Table of Contents
Fetching ...

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao

TL;DR

This work introduces RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model, and demonstrates the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

TL;DR

This work introduces RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model, and demonstrates the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
Paper Structure (22 sections, 4 equations, 5 figures, 2 tables)

This paper contains 22 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of RetouchIQ. Left: We annotate the user instruction and reasoning for training data. Generated data are filtered to ensure quality. Middle: The supervised fine-tuning stage. Based on the user instruction (e.g., pop more), the policy model needs to reason the correct parameters (e.g.,exposure) and change them accordingly. Right: The reinforcement learning stage. We leverage a generalist reward model to propose metrics and provide scalar reward guidance for policy model. Details of the reward model is introduced in Sec \ref{['sec:GRM']}.
  • Figure 2: Problematic rewards in image retouching tasks. Given a before–after image pair (left), ❶ verifiable rewards (middle) rely on metrics between the edited image and ground truth, such as pixel differences. However, since multiple valid edits can satisfy user intent, these rewards become imprecise. ❷ The reward model's precision strongly depends on its training data distribution (right). When trained to distinguish good user edits from randomly perturbed images, it may later struggle to assess results from the policy model that produces combined, complex edits.
  • Figure 3: Overview of generalist reward model. Left: Given a before-edited image and a user-edited after image After-u ("u" for "user"), we perturb the editing process to obtain a suboptimal image, After-w ("w" for "weak"). Middle: In the SFT stage,the reward model first learns to generate metrics and then produces scalar rewards for the two input images. Right: In the RL stage, we introduce policy-guided reward training (PGRT), where the suboptimal image After-w is provided by the policy model rather than generated through perturbation. The goal of the RL stage is to assign a higher scalar reward to the user-edited (strong) image After-u.
  • Figure 4: Comparison of reward model and policy model performance under different reward model configurations. The lines show the accuracies of the reward model, while the bars indicate the scores of the corresponding policy model.
  • Figure 5: Qualitative results across diverse image retouching scenarios, including quality enhancement (top), style transformation (middle), and local retouching (bottom). For each example, the input image and the corresponding user-edited result are shown on the left.