Table of Contents
Fetching ...

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai

TL;DR

This work proposes ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance that achieves superior performance-efficiency trade-offs with comparable sampling budgets.

Abstract

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

TL;DR

This work proposes ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance that achieves superior performance-efficiency trade-offs with comparable sampling budgets.

Abstract

Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
Paper Structure (36 sections, 8 equations, 23 figures, 10 tables)

This paper contains 36 sections, 8 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Impact of Image-CoT on different generative tasks. (a) T2I generation is an open-ended task benefiting from large-scale sampling. (b) Image editing is a goal-directed task where outputs are constrained by the prompt and source image, leading to redundant correct outputs after large-scale sampling.
  • Figure 2: Why existing Image-CoT methods are suboptimal for editing.(a) Inefficient resource allocation: Fixed budgets waste computation on simple edits (high initial scores, red box) that show minimal improvement. (b) Unreliable early-stage verification: 40% of samples with low early scores achieve high final scores, but are pruned incorrectly by general MLLM scores. (c) Redundant edited results: Large-scale sampling produces redundant correct outputs with identical best scores, though only one is sufficient.
  • Figure 3: Pipeline comparison of Image-CoT methods for editing. (a) Best-of-N employs a breadth-first search with a fixed sampling budget, producing all $N$ candidates before verification. (b) Previous pruning strategies improve efficiency by pruning with general MLLM scores. (c) Our ADE-CoT employs three strategies for improved quality and efficiency: difficulty-aware resource allocation to dynamically adjust the sampling budget (orange, \ref{['sec: adaptive_sample']}), edit-specific verification to identify promising candidates in early denoising stage (blue, \ref{['sec: prune_and_rank']}), and depth-first opportunistic stopping to terminate search when intent-aligned results are obtained (green, \ref{['sec: adapt_stop']}).
  • Figure 4: Scaling curves on GEdit-Bench across different editing models. We show overall performance (G_O, y-axis) versus computational cost (NFE, x-axis) for sampling budgets of $N = 1, 2, 4, 8, 16, 32$. The shaded regions indicate error bars. Our ADE-CoT (purple star) consistently surpasses SOTA Image-CoT methods across all models and budgets, achieving a better performance-efficiency trade-off.
  • Figure 5: Effect of $\gamma$ in difficulty-aware resource allocation.
  • ...and 18 more figures