Instruction-guided Image Editing • Open-Source Agent

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng, Hang Hua, Jiebo Luo · University of Rochester · MIT-IBM Watson AI Lab

MIRA reframes image editing as an iterative perception–reasoning–action loop. It decomposes complex natural-language instructions into atomic edits and orchestrates open-source diffusion models step by step, improving semantic consistency and perceptual quality over both open-source and proprietary baselines.

This is a temporary project page. Replace the placeholder links with the official arXiv and GitHub URLs when ready.

Abstract

Instruction-guided image editing allows users to modify images with natural language prompts, but existing diffusion-based editors often struggle when instructions are long, compositional, or context dependent. They tend to drift semantically or miss fine-grained constraints.

MIRA is a lightweight multimodal agent that performs editing via an iterative perception–reasoning–action loop rather than a single-shot prompt. At each step, the model observes the original image, the current edited image, and the complex instruction, then predicts the next atomic edit or a stop signal. An external editor executes the atomic instruction, and the updated image is fed back for the next decision.

Paired with open-source editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly boosts semantic consistency and perceptual quality, reaching or surpassing proprietary systems like GPT-Image and Nano-Banana on challenging benchmarks.

Highlights

Key contributions of MIRA and its training ecosystem.

Model

  • Lightweight, plug-and-play agent built on Qwen2.5-VL, operating purely through atomic edit instructions.
  • Closed-loop reasoning: each decision is conditioned on the latest visual feedback, reducing error accumulation from earlier steps.
  • Modular tool use: supports interchangeable editing backbones without requiring large toolchains or complex orchestration.
Vision–Language Agent Instruction Following Tool Use

Training & Data

  • MIRA-EDITING: 150K multimodal trajectories with complex instructions and multi-step editing sequences.
  • Two-stage training: Supervised fine-tuning + GRPO-based reinforcement learning with a composite reward.
  • Reward design: open-source EditScore-based semantic and perceptual scores guide policy improvement.
SFT GRPO EditScore

Framework

MIRA as an iterative vision–language agent for instruction-guided image editing.

Given an original image I₀ and a complex instruction C, MIRA maintains an intermediate edited image Iₜ₋₁. At each iteration:

  • The policy π receives (I₀, Iₜ₋₁, C) and outputs an atomic edit instruction uₜ.
  • An external image editor E (e.g., Flux.1-Kontext) executes uₜ, producing a new image Iₜ.
  • A termination controller decides whether to continue or <STOP>.

This loop continues until the global goal encoded by C is satisfied. Unlike pre-planned, open-loop pipelines, MIRA can detect mis-executed edits and issue corrective instructions, making it naturally robust to backend model errors.

# Pseudocode example of MIRA-style editing loop

I0 = load_image("input.png")
C  = "Change the white stove to black, make the floor wooden, and turn the cabinets brown."

I  = I0
while True:
    u = mira_policy(I0, I, C)      # atomic instruction
    if u == "<STOP>":
        break
    I = editing_model(I, u)        # e.g., Flux.1-Kontext, Step1X-Edit, Qwen-Image-Edit

save(I, "output.png")

MIRA-EDITING Dataset

A 150K-sample dataset tailored for multi-step, instruction-based image editing.

  • Instruction aggregation: multi-turn editing sessions (e.g., from SeedEdit) are merged into single complex instructions, with permutations to enrich compositional structures.
  • Two-level rewriting: both atomic edits and aggregated instructions are paraphrased for linguistic diversity.
  • Candidate generation & ranking: strong open-source editors generate multiple trajectories; ViEscore-based models rank them, and the best is kept as supervision.

Each dataset entry includes the input image, a high-quality edited trajectory, the complex instruction, and its variants, all decomposed into start, continue, and stop samples for training the iterative policy.

Results

MIRA consistently improves semantic consistency and perceptual quality when plugged into open-source editors.

Semantic Consistency
↑ up to +15%
Gains on GPT-SC / Gemini-SC / Qwen-SC over base models.
Perceptual Quality
Cleaner edits
ARNIQA & TOPIQ improvements despite multi-step editing.
Backbones
Flux · Step1X · Qwen
All benefit from the same MIRA policy without retraining editors.

On a 500-sample benchmark constructed from MagicBrush and CompBench subsets, MIRA-enhanced models narrow the gap to GPT-Image and Nano-Banana, while remaining fully open-source and modular.

Qualitative comparisons in the paper further show that MIRA can:

  • Follow long, multi-object instructions faithfully.
  • Gradually refine layouts and appearances with visible intermediate states.
  • Issue corrective edits when backend models make mistakes (e.g., recoloring a fridge back to white after an unintended change).

BibTeX

@article{zeng2025mira,
  title   = {MIRA: Multimodal Iterative Reasoning Agent for Image Editing},
  author  = {Zeng, Ziyun and Hua, Hang and Luo, Jiebo},
  journal = {arXiv preprint arXiv:YYYY.YYYYY},
  year    = {2025}
}

Contact

For questions, implementation details, or collaboration, please reach out to the authors via their institutional email addresses or open an issue on the project repository.