Table of Contents
Fetching ...

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu

TL;DR

Agent Banana tackles the gap between research editors and professional workflows by enabling high-fidelity, multi-turn image editing directly on native 4K assets. It advances a hierarchical planner–executor architecture with Context Folding and Image Layer Decomposition to maintain long-horizon reasoning and local, artifact-free edits, respectively. The HDD-Bench benchmark provides verifiable, stepwise targets for 4K editing, revealing improvements in instruction following, multi-turn consistency, and background fidelity, while preserving high-resolution details. Collectively, the approach enables reliable, professional-grade agentic image editing with potential integration into real-world media pipelines, alongside a scalable evaluation framework to diagnose long-horizon failure modes.

Abstract

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.

Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

TL;DR

Agent Banana tackles the gap between research editors and professional workflows by enabling high-fidelity, multi-turn image editing directly on native 4K assets. It advances a hierarchical planner–executor architecture with Context Folding and Image Layer Decomposition to maintain long-horizon reasoning and local, artifact-free edits, respectively. The HDD-Bench benchmark provides verifiable, stepwise targets for 4K editing, revealing improvements in instruction following, multi-turn consistency, and background fidelity, while preserving high-resolution details. Collectively, the approach enables reliable, professional-grade agentic image editing with potential integration into real-world media pipelines, alongside a scalable evaluation framework to diagnose long-horizon failure modes.

Abstract

We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
Paper Structure (27 sections, 4 equations, 6 figures, 2 tables)

This paper contains 27 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We present AgentBanana, an agentic editing system that enables high-fidelity, native-resolution image editing through reasoning-based natural-language interaction, where each edit is context-aware, logically dependent, and locally precise. In this example, the user provides a vague yet complex editing prompt, and Agent Banana iteratively refines a scene in native high resolution ($5460\times 3640$)—from a stylistic replacement (Turn 1), to attribute decoupling that preserves non-target dynamics (changing the bottle color without affecting the pouring liquid; Turn 2), and finally to retrieving prior state and adding fine details (Turn 3). The result is a professional-style workflow that resists over-editing and background drift, while faithfully preserving what should remain unchanged.
  • Figure 2: Overview of the Agent Banana Framework. The system operates in a multi-turn loop (Left), comprising two core agents: a Planner that decomposes user queries into executable editing plans, and an Executor that selects tools via the MCP Server. Crucially, the Executor incorporates a self-correction mechanism (Quality Test), reiterating the editing process if the quality check fails before presenting the result to the user. (Right) Our Evaluator assesses performance by analyzing the transition between Turn $n-1$ and Turn $n$, utilizing instruction adherence checks and state tracking (JSON) to derive the final score.
  • Figure 3: Scalable Data Pipeline for Multi-turn Editing. This diagram illustrates the process of generating aligned (State, Instruction) pairs from HD images.
  • Figure 4: Qualitative Comparison of Editing Fidelity. We utilize the instruction "...And change that little bright blue cooler under the shelter to a softer sea‑foam green with a creamy top ..." to guide the editing process. While the prompt solely targets color modification, baseline models exhibit significant limitations: they often suffer from reduced resolution, introduce unwanted structural changes (modifying shape or position), or fail to apply the target color change. By leveraging our agent's superior interpretation capabilities, our method accurately captures the instruction's focus while preserving the integrity of the original image.
  • Figure 5: Qualitative Comparison of Unedited Region Consistency. Although the editing instruction does not target the sofa cushion, Nano Banana Pro distorts the original details due to global editing. In contrast, our method successfully maintains the visual consistency of the unedited regions.
  • ...and 1 more figures