Preprint · 2026

MementoGUI: Learning Agentic Multimodal Memory Control
for Long-Horizon GUI Agents

Ziyun Zeng^1,* Hang Hua^2,*,† Bocheng Zou³ Mu Cai³ Rogerio Feris² Jiebo Luo¹

¹University of Rochester · ²MIT-IBM Watson AI Lab · ³University of Wisconsin-Madison

^† Project Lead ^* Equal Contribution

MementoGUI augments a frozen GUI action backbone with multimodal working and episodic memory. MementoCore updates, retrieves, and writes memory, then serializes textual summaries and ROI references as multimodal context for GUI action prediction.

TL;DR

Long-horizon GUI control is bottlenecked not by single-step perception, but by active management of long-term multimodal state. MementoGUI reframes this as an online memory-control problem: a plug-in controller, MementoCore, decides when to update memory, what to preserve, how to compress, and when to retrieve — without finetuning the underlying GUI backbone.

Frozen GUI backbone, no action-model finetuning 4 LoRA adapters on a shared Qwen3-VL Working & episodic memory, jointly controlled ROI-level visual evidence, not just text summaries Plug-in for open-source and proprietary MLLMs

Abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

MementoCore: Four Memory-Control Operators

MementoCore attaches four task-specific LoRA adapters to a shared frozen Qwen3-VL backbone. Each adapter plays one well-defined role in the online memory loop; together they turn raw interaction history into decision-oriented multimodal context.

LoRA 1

Step Processor

Scores each new frame for write-salience, produces an event summary and an ROI bounding box, and flags whether episodic retrieval should fire.

LoRA 2

WM Compressor

Consolidates older working-memory entries into compact summaries while preserving representative visual identifiers for later grounding.

LoRA 3

Episodic Writer

Converts a completed trajectory into a reusable episodic entry — summary, outcome metadata, retrieval embeddings, and representative ROI crops.

LoRA 4

Episodic Selector

Filters coarse vector-retrieved candidates by multimodal relevance to the current task state before they are injected into the backbone prompt.

Data Curation Pipeline

We automatically convert raw computer-use videos into structured supervision for all four operators. The pipeline parses each trajectory into frame-level and subgoal-level annotations, assembles SFT data for step processing, memory compression, episodic writing, and episodic selection, and finally constructs DPO preference pairs for the two online operators via rule-based corruption and VLM-judged filtering. On 200 human-validated trajectories, 197 are judged fully correct.

Figure: (A) Raw computer-use videos are parsed into hierarchical frame- and subgoal-level annotations. (B) These annotations are converted into SFT data for four MementoCore operators. (C) Step-processing and memory-compression samples are further corrupted and VLM-filtered to form DPO training pairs.

Main Results

We evaluate four frozen open-source GUI backbones and two proprietary generalist MLLM agents on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench. MementoGUI consistently improves over no-history, raw-history, and text-only memory baselines across all backbones.

Backbone	Method	AMS ↑	Traj. SR ↑	Step SR ↑	VAM ↑	TPS ↑	MCS ↑
		GUI-Odyssey		MM-Mind2Web	MementoGUI-Bench
Proprietary generalist MLLM agents
GPT-5.5	Direct Prompting	54.46	2.02	18.81	1.95	2.00	2.86
Gemini-3.1-Pro	Direct Prompting	60.62	1.81	22.97	2.18	2.67	2.75
Open-source frozen GUI backbones
UI-Venus-1.5-8B	No History	54.58	1.29	5.29	1.80	2.00	0.00
	Pred. Hist. All	66.31	2.33	7.57	1.58	2.29	5.36
	Text Summary Memory	62.18	2.12	11.66	1.58	3.38	3.08
	Working Memory	67.69	2.69	11.80	1.41	4.63	7.00
	+ Episodic Memory	68.32	3.57	12.60	1.67	5.16	7.14
MAI-UI-8B	No History	35.70	0.36	12.53	1.81	2.00	0.00
	Pred. Hist. All	44.79	1.35	13.44	1.54	2.11	7.55
	Text Summary Memory	38.33	0.62	13.28	1.63	3.00	3.23
	Working Memory	49.08	1.97	14.61	1.76	5.13	8.18
	+ Episodic Memory	49.31	2.12	14.67	1.88	5.36	8.13
GUI-Owl-1.5-8B	No History	40.15	0.16	17.78	1.60	2.00	0.00
	Pred. Hist. All	38.88	0.47	13.56	1.04	2.00	5.25
	Text Summary Memory	45.45	0.62	14.14	1.51	3.05	3.10
	Working Memory	48.25	1.40	15.53	1.79	3.38	7.39
	+ Episodic Memory	49.45	1.71	16.42	1.82	4.18	7.73
GUI-Owl-1.5-32B	No History	45.73	0.57	18.98	2.10	2.00	0.00
	Pred. Hist. All	49.02	1.55	16.48	1.83	2.11	4.73
	Text Summary Memory	47.21	0.72	17.00	2.37	3.40	3.40
	Working Memory	51.36	2.33	17.71	2.50	4.24	7.85
	+ Episodic Memory	55.17	2.59	19.12	2.89	4.31	8.30

Metrics: AMS and trajectory success (%) on GUI-Odyssey, Step SR (%) on MM-Mind2Web, and VAM / TPS / MCS on MementoGUI-Bench (Gemini-3.1-Pro as judge).

GUI-Odyssey performance by trajectory length

Figure: GUI-Odyssey performance on UI-Venus-1.5-8B, broken out by trajectory-length bins. Gains from working + episodic memory grow with trajectory length, where raw history and text-only memory struggle the most.

Figure: Effect of episodic-memory bank size on GUI-Odyssey across frozen backbones. Larger banks generally improve trajectory success, indicating that reusable experience mainly benefits long-horizon completion.

Ablations

Visual grounding matters in working memory

Removing ROI reference images from the memory context (keeping the same learned memory controller) consistently hurts AMS, VAM, and MCS on both UI-Venus-1.5-8B and GUI-Owl-1.5-8B. Localized visual evidence is not redundant with textual summaries.

Learned episodic selection beats raw retrieval

Random episodic context and single-stage embedding retrieval both trail the full two-stage (retrieve + learned-select) pipeline. MementoGUI benefits from both ROI-level grounding and filtered episodic experience, rather than just adding more context.

Scaling MementoCore

With the memory architecture and frozen GUI backbone fixed, we scale the memory controller from 2B → 4B → 8B. Larger controllers generally improve long-horizon decision support, with the 8B variant reaching the strongest AMS for both UI-Venus-1.5-8B and GUI-Owl-1.5-8B on GUI-Odyssey, and the strongest VAM for GUI-Owl-1.5-8B on MementoGUI-Bench. MementoGUI therefore scales as a plug-in memory layer — one can replace the controller with a stronger variant without touching the action model.

MementoGUI-Bench

MementoGUI-Bench is an offline benchmark for memory-dependent long-horizon GUI decision making, derived from PSAI computer-use videos. It contains 200 trajectories and 6,953 steps (average 34.8 steps per trajectory). 80 trajectories are used for testing and 120 for accumulating episodic memory. The benchmark focuses on cases where the next action depends on accumulated state: delayed constraints, completed subgoals, or prior experience.

Alongside standard reference-based scores, we report three VLM-judged, memory-aware metrics:

Metric

VAM

VLM-based Action Match — is the predicted action semantically equivalent to the reference action on the current screenshot?

Metric

TPS

Task Progress Score — does the predicted sequence move the task forward, without loops, regressions, or stalling?

Metric

MCS

Memory Consistency Score — does the memory state evolve consistently with task progress, user constraints, and retrieved episodic experience?

MementoCore Working Visualization

To give a concrete sense of how the four adapters interact at inference time, our supplementary visualization traces one pass through the pipeline on a real trajectory: a frame is scored and ROI-localised by the Step Processor, tail-step entries are collapsed by the WM Compressor, a completed trajectory is committed by the Episodic Writer, and retrieved candidates are re-ranked by the Episodic Selector before the merged context is handed to the action backbone. The full per-stage prompts, multi-image inputs, and raw JSON outputs — for every stage and the backbone — are provided there.

BibTeX

@article{zeng2026mementogui,
  title   = {MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents},
  author  = {Zeng, Ziyun and Hua, Hang and Zou, Bocheng and Cai, Mu and Feris, Rogerio and Luo, Jiebo},
  journal = {arXiv preprint},
  year    = {2026}
}