Table of Contents
Fetching ...

JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, Qinglin Lu

TL;DR

<3-5 sentence high-level summary> JarvisEvo addresses instruction hallucination and reward hacking in image-editing agents by coupling interleaved multimodal reasoning (iMCoT) with a co-evolutionary editor–evaluator optimization (SEPO) and on-policy reflection. It unifies editing and evaluation within a single model and leverages Adobe Lightroom for global and local refinements, guided by a two-loop training regime and reflective fine-tuning. SEPO combines intrinsic self-rewards for editing with verifiable, human-annotated evaluation to stabilize learning and reduce self-deception and reward hacking. On ArtEdit-Bench, JarvisEvo outperforms strong baselines in preservative editing and aligns evaluation with human judgments more closely than competing models.</p>

Abstract

Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity. Project page: https://jarvisevo.vercel.app/

JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

TL;DR

<3-5 sentence high-level summary> JarvisEvo addresses instruction hallucination and reward hacking in image-editing agents by coupling interleaved multimodal reasoning (iMCoT) with a co-evolutionary editor–evaluator optimization (SEPO) and on-policy reflection. It unifies editing and evaluation within a single model and leverages Adobe Lightroom for global and local refinements, guided by a two-loop training regime and reflective fine-tuning. SEPO combines intrinsic self-rewards for editing with verifiable, human-annotated evaluation to stabilize learning and reduce self-deception and reward hacking. On ArtEdit-Bench, JarvisEvo outperforms strong baselines in preservative editing and aligns evaluation with human judgments more closely than competing models.</p>

Abstract

Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity. Project page: https://jarvisevo.vercel.app/

Paper Structure

This paper contains 20 sections, 3 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: JarvisEvo performs interleaved multimodal Chain-of-Thought (iMCoT) reasoning for image editing, which marries multi-step planning, dynamic tool orchestration, and iterative visual feedback. This closed-loop workflow incorporates self-evaluation and refinement to ensure the final output is both visually compelling and faithful to the creative vision.
  • Figure 2: Inference and training pipelines of JarvisEvo.
  • Figure 3: The Synergistic Editor–Evaluator Optimization (SEPO) framework consists of two iterative loops. Loop 1 optimizes the editor policy using self-evaluation scores, thereby improving iMCoT reasoning and tool use. In addition, an online reflection data generation pipeline autonomously constructs reflection samples, which are then used to further fine-tune the model’s reflective capabilities. Loop 2 refines the evaluator policy with human-labeled evaluation data to ensure reliable assessment and to mitigate self-deception or reward hacking during editor optimization.
  • Figure 4: Scenario and prompt distribution on ArtEdit.
  • Figure 5: Reflection data sample generated during on-policy updates in SEPO, including erroneous editing trajectories, corrective reflections, accurate editing operations, and corresponding images.
  • ...and 13 more figures