1University of Rochester · 2MIT-IBM Watson AI Lab · 3University of Wisconsin-Madison
† Project Lead * Equal Contribution
Long-horizon GUI control is bottlenecked not by single-step perception, but by active management of long-term multimodal state. MementoGUI reframes this as an online memory-control problem: a plug-in controller, MementoCore, decides when to update memory, what to preserve, how to compress, and when to retrieve — without finetuning the underlying GUI backbone.
Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
MementoCore attaches four task-specific LoRA adapters to a shared frozen Qwen3-VL backbone. Each adapter plays one well-defined role in the online memory loop; together they turn raw interaction history into decision-oriented multimodal context.
We automatically convert raw computer-use videos into structured supervision for all four operators. The pipeline parses each trajectory into frame-level and subgoal-level annotations, assembles SFT data for step processing, memory compression, episodic writing, and episodic selection, and finally constructs DPO preference pairs for the two online operators via rule-based corruption and VLM-judged filtering. On 200 human-validated trajectories, 197 are judged fully correct.
Figure: (A) Raw computer-use videos are parsed into hierarchical frame- and subgoal-level annotations. (B) These annotations are converted into SFT data for four MementoCore operators. (C) Step-processing and memory-compression samples are further corrupted and VLM-filtered to form DPO training pairs.
We evaluate four frozen open-source GUI backbones and two proprietary generalist MLLM agents on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench. MementoGUI consistently improves over no-history, raw-history, and text-only memory baselines across all backbones.
| Backbone | Method | AMS ↑ | Traj. SR ↑ | Step SR ↑ | VAM ↑ | TPS ↑ | MCS ↑ |
|---|---|---|---|---|---|---|---|
| GUI-Odyssey | MM-Mind2Web | MementoGUI-Bench | |||||
| Proprietary generalist MLLM agents | |||||||
| GPT-5.5 | Direct Prompting | 54.46 | 2.02 | 18.81 | 1.95 | 2.00 | 2.86 |
| Gemini-3.1-Pro | Direct Prompting | 60.62 | 1.81 | 22.97 | 2.18 | 2.67 | 2.75 |
| Open-source frozen GUI backbones | |||||||
| UI-Venus-1.5-8B | No History | 54.58 | 1.29 | 5.29 | 1.80 | 2.00 | 0.00 |
| Pred. Hist. All | 66.31 | 2.33 | 7.57 | 1.58 | 2.29 | 5.36 | |
| Text Summary Memory | 62.18 | 2.12 | 11.66 | 1.58 | 3.38 | 3.08 | |
| Working Memory | 67.69 | 2.69 | 11.80 | 1.41 | 4.63 | 7.00 | |
| + Episodic Memory | 68.32 | 3.57 | 12.60 | 1.67 | 5.16 | 7.14 | |
| MAI-UI-8B | No History | 35.70 | 0.36 | 12.53 | 1.81 | 2.00 | 0.00 |
| Pred. Hist. All | 44.79 | 1.35 | 13.44 | 1.54 | 2.11 | 7.55 | |
| Text Summary Memory | 38.33 | 0.62 | 13.28 | 1.63 | 3.00 | 3.23 | |
| Working Memory | 49.08 | 1.97 | 14.61 | 1.76 | 5.13 | 8.18 | |
| + Episodic Memory | 49.31 | 2.12 | 14.67 | 1.88 | 5.36 | 8.13 | |
| GUI-Owl-1.5-8B | No History | 40.15 | 0.16 | 17.78 | 1.60 | 2.00 | 0.00 |
| Pred. Hist. All | 38.88 | 0.47 | 13.56 | 1.04 | 2.00 | 5.25 | |
| Text Summary Memory | 45.45 | 0.62 | 14.14 | 1.51 | 3.05 | 3.10 | |
| Working Memory | 48.25 | 1.40 | 15.53 | 1.79 | 3.38 | 7.39 | |
| + Episodic Memory | 49.45 | 1.71 | 16.42 | 1.82 | 4.18 | 7.73 | |
| GUI-Owl-1.5-32B | No History | 45.73 | 0.57 | 18.98 | 2.10 | 2.00 | 0.00 |
| Pred. Hist. All | 49.02 | 1.55 | 16.48 | 1.83 | 2.11 | 4.73 | |
| Text Summary Memory | 47.21 | 0.72 | 17.00 | 2.37 | 3.40 | 3.40 | |
| Working Memory | 51.36 | 2.33 | 17.71 | 2.50 | 4.24 | 7.85 | |
| + Episodic Memory | 55.17 | 2.59 | 19.12 | 2.89 | 4.31 | 8.30 | |
Metrics: AMS and trajectory success (%) on GUI-Odyssey, Step SR (%) on MM-Mind2Web, and VAM / TPS / MCS on MementoGUI-Bench (Gemini-3.1-Pro as judge).
Figure: GUI-Odyssey performance on UI-Venus-1.5-8B, broken out by trajectory-length bins. Gains from working + episodic memory grow with trajectory length, where raw history and text-only memory struggle the most.
Figure: Effect of episodic-memory bank size on GUI-Odyssey across frozen backbones. Larger banks generally improve trajectory success, indicating that reusable experience mainly benefits long-horizon completion.
Removing ROI reference images from the memory context (keeping the same learned memory controller) consistently hurts AMS, VAM, and MCS on both UI-Venus-1.5-8B and GUI-Owl-1.5-8B. Localized visual evidence is not redundant with textual summaries.
Random episodic context and single-stage embedding retrieval both trail the full two-stage (retrieve + learned-select) pipeline. MementoGUI benefits from both ROI-level grounding and filtered episodic experience, rather than just adding more context.
With the memory architecture and frozen GUI backbone fixed, we scale the memory controller from 2B → 4B → 8B. Larger controllers generally improve long-horizon decision support, with the 8B variant reaching the strongest AMS for both UI-Venus-1.5-8B and GUI-Owl-1.5-8B on GUI-Odyssey, and the strongest VAM for GUI-Owl-1.5-8B on MementoGUI-Bench. MementoGUI therefore scales as a plug-in memory layer — one can replace the controller with a stronger variant without touching the action model.
MementoGUI-Bench is an offline benchmark for memory-dependent long-horizon GUI decision making, derived from PSAI computer-use videos. It contains 200 trajectories and 6,953 steps (average 34.8 steps per trajectory). 80 trajectories are used for testing and 120 for accumulating episodic memory. The benchmark focuses on cases where the next action depends on accumulated state: delayed constraints, completed subgoals, or prior experience.
Alongside standard reference-based scores, we report three VLM-judged, memory-aware metrics:
Benchmark composition. 200 trajectories stratified by category (Browser / Computer) and difficulty (Easy / Medium / Hard).
Trajectory length. Distribution of action frames per trajectory across the 200-trajectory pool.
Training coverage — head. Top-15 applications in the memory-controller training pool.
Training coverage — long tail. Word cloud over the remaining 429 applications in the training pool.
To give a concrete sense of how the four adapters interact at inference time, our supplementary visualization traces one pass through the pipeline on a real trajectory: a frame is scored and ROI-localised by the Step Processor, tail-step entries are collapsed by the WM Compressor, a completed trajectory is committed by the Episodic Writer, and retrieved candidates are re-ranked by the Episodic Selector before the merged context is handed to the action backbone. The full per-stage prompts, multi-image inputs, and raw JSON outputs — for every stage and the backbone — are provided there.
@article{zeng2026mementogui,
title = {MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents},
author = {Zeng, Ziyun and Hua, Hang and Zou, Bocheng and Cai, Mu and Feris, Rogerio and Luo, Jiebo},
journal = {arXiv preprint},
year = {2026}
}