Preprint · 2026

MementoGUI: Learning Agentic Multimodal Memory Control
for Long-Horizon GUI Agents

Ziyun Zeng1,*   Hang Hua2,*,†   Bocheng Zou3   Mu Cai3   Rogerio Feris2   Jiebo Luo1

1University of Rochester  ·  2MIT-IBM Watson AI Lab  ·  3University of Wisconsin-Madison

Project Lead    * Equal Contribution

MementoGUI framework overview

MementoGUI augments a frozen GUI action backbone with multimodal working and episodic memory. MementoCore updates, retrieves, and writes memory, then serializes textual summaries and ROI references as multimodal context for GUI action prediction.

TL;DR

Long-horizon GUI control is bottlenecked not by single-step perception, but by active management of long-term multimodal state. MementoGUI reframes this as an online memory-control problem: a plug-in controller, MementoCore, decides when to update memory, what to preserve, how to compress, and when to retrieve — without finetuning the underlying GUI backbone.

Frozen GUI backbone, no action-model finetuning 4 LoRA adapters on a shared Qwen3-VL Working & episodic memory, jointly controlled ROI-level visual evidence, not just text summaries Plug-in for open-source and proprietary MLLMs

Abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

MementoCore: Four Memory-Control Operators

MementoCore attaches four task-specific LoRA adapters to a shared frozen Qwen3-VL backbone. Each adapter plays one well-defined role in the online memory loop; together they turn raw interaction history into decision-oriented multimodal context.

LoRA 1
Step Processor
Scores each new frame for write-salience, produces an event summary and an ROI bounding box, and flags whether episodic retrieval should fire.
LoRA 2
WM Compressor
Consolidates older working-memory entries into compact summaries while preserving representative visual identifiers for later grounding.
LoRA 3
Episodic Writer
Converts a completed trajectory into a reusable episodic entry — summary, outcome metadata, retrieval embeddings, and representative ROI crops.
LoRA 4
Episodic Selector
Filters coarse vector-retrieved candidates by multimodal relevance to the current task state before they are injected into the backbone prompt.

Data Curation Pipeline

We automatically convert raw computer-use videos into structured supervision for all four operators. The pipeline parses each trajectory into frame-level and subgoal-level annotations, assembles SFT data for step processing, memory compression, episodic writing, and episodic selection, and finally constructs DPO preference pairs for the two online operators via rule-based corruption and VLM-judged filtering. On 200 human-validated trajectories, 197 are judged fully correct.

MementoGUI data curation pipeline

Figure: (A) Raw computer-use videos are parsed into hierarchical frame- and subgoal-level annotations. (B) These annotations are converted into SFT data for four MementoCore operators. (C) Step-processing and memory-compression samples are further corrupted and VLM-filtered to form DPO training pairs.

Main Results

We evaluate four frozen open-source GUI backbones and two proprietary generalist MLLM agents on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench. MementoGUI consistently improves over no-history, raw-history, and text-only memory baselines across all backbones.

BackboneMethod AMS ↑Traj. SR ↑ Step SR ↑ VAM ↑TPS ↑MCS ↑
GUI-Odyssey MM-Mind2Web MementoGUI-Bench
Proprietary generalist MLLM agents
GPT-5.5Direct Prompting54.462.0218.811.952.002.86
Gemini-3.1-ProDirect Prompting60.621.8122.972.182.672.75
Open-source frozen GUI backbones
UI-Venus-1.5-8BNo History54.581.295.291.802.000.00
Pred. Hist. All66.312.337.571.582.295.36
Text Summary Memory62.182.1211.661.583.383.08
Working Memory67.692.6911.801.414.637.00
+ Episodic Memory68.323.5712.601.675.167.14
MAI-UI-8BNo History35.700.3612.531.812.000.00
Pred. Hist. All44.791.3513.441.542.117.55
Text Summary Memory38.330.6213.281.633.003.23
Working Memory49.081.9714.611.765.138.18
+ Episodic Memory49.312.1214.671.885.368.13
GUI-Owl-1.5-8BNo History40.150.1617.781.602.000.00
Pred. Hist. All38.880.4713.561.042.005.25
Text Summary Memory45.450.6214.141.513.053.10
Working Memory48.251.4015.531.793.387.39
+ Episodic Memory49.451.7116.421.824.187.73
GUI-Owl-1.5-32BNo History45.730.5718.982.102.000.00
Pred. Hist. All49.021.5516.481.832.114.73
Text Summary Memory47.210.7217.002.373.403.40
Working Memory51.362.3317.712.504.247.85
+ Episodic Memory55.172.5919.122.894.318.30

Metrics: AMS and trajectory success (%) on GUI-Odyssey, Step SR (%) on MM-Mind2Web, and VAM / TPS / MCS on MementoGUI-Bench (Gemini-3.1-Pro as judge).

GUI-Odyssey performance by trajectory length

Figure: GUI-Odyssey performance on UI-Venus-1.5-8B, broken out by trajectory-length bins. Gains from working + episodic memory grow with trajectory length, where raw history and text-only memory struggle the most.

Effect of episodic memory bank size

Figure: Effect of episodic-memory bank size on GUI-Odyssey across frozen backbones. Larger banks generally improve trajectory success, indicating that reusable experience mainly benefits long-horizon completion.

Ablations

Visual grounding matters in working memory

Removing ROI reference images from the memory context (keeping the same learned memory controller) consistently hurts AMS, VAM, and MCS on both UI-Venus-1.5-8B and GUI-Owl-1.5-8B. Localized visual evidence is not redundant with textual summaries.

Learned episodic selection beats raw retrieval

Random episodic context and single-stage embedding retrieval both trail the full two-stage (retrieve + learned-select) pipeline. MementoGUI benefits from both ROI-level grounding and filtered episodic experience, rather than just adding more context.

Scaling MementoCore

With the memory architecture and frozen GUI backbone fixed, we scale the memory controller from 2B → 4B → 8B. Larger controllers generally improve long-horizon decision support, with the 8B variant reaching the strongest AMS for both UI-Venus-1.5-8B and GUI-Owl-1.5-8B on GUI-Odyssey, and the strongest VAM for GUI-Owl-1.5-8B on MementoGUI-Bench. MementoGUI therefore scales as a plug-in memory layer — one can replace the controller with a stronger variant without touching the action model.

MementoGUI-Bench

MementoGUI-Bench is an offline benchmark for memory-dependent long-horizon GUI decision making, derived from PSAI computer-use videos. It contains 200 trajectories and 6,953 steps (average 34.8 steps per trajectory). 80 trajectories are used for testing and 120 for accumulating episodic memory. The benchmark focuses on cases where the next action depends on accumulated state: delayed constraints, completed subgoals, or prior experience.

Alongside standard reference-based scores, we report three VLM-judged, memory-aware metrics:

Metric
VAM
VLM-based Action Match — is the predicted action semantically equivalent to the reference action on the current screenshot?
Metric
TPS
Task Progress Score — does the predicted sequence move the task forward, without loops, regressions, or stalling?
Metric
MCS
Memory Consistency Score — does the memory state evolve consistently with task progress, user constraints, and retrieved episodic experience?
Difficulty × category composition

Benchmark composition. 200 trajectories stratified by category (Browser / Computer) and difficulty (Easy / Medium / Hard).

Action frames per trajectory

Trajectory length. Distribution of action frames per trajectory across the 200-trajectory pool.

Top-15 training apps

Training coverage — head. Top-15 applications in the memory-controller training pool.

Training app long tail

Training coverage — long tail. Word cloud over the remaining 429 applications in the training pool.

MementoCore Working Visualization

To give a concrete sense of how the four adapters interact at inference time, our supplementary visualization traces one pass through the pipeline on a real trajectory: a frame is scored and ROI-localised by the Step Processor, tail-step entries are collapsed by the WM Compressor, a completed trajectory is committed by the Episodic Writer, and retrieved candidates are re-ranked by the Episodic Selector before the merged context is handed to the action backbone. The full per-stage prompts, multi-image inputs, and raw JSON outputs — for every stage and the backbone — are provided there.

BibTeX

@article{zeng2026mementogui,
  title   = {MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents},
  author  = {Zeng, Ziyun and Hua, Hang and Zou, Bocheng and Cai, Mu and Feris, Rogerio and Luo, Jiebo},
  journal = {arXiv preprint},
  year    = {2026}
}