Table of Contents
Fetching ...

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang

TL;DR

This work tackles training-free text-guided image editing by removing the dependency on inversion and its fidelity pitfalls. It introduces AREdit, a Visual AutoRegressive Modeling (VAR) framework backed by Infinity-2B, that uses randomness caching, adaptive fine-grained masking, and token re-assembly to realize precise, local edits while preserving non-edited regions. The method achieves high-fidelity edits with fast inference, demonstrated on the PIE-Bench dataset with performance comparable to or better than diffusion- and rectified-flow-based approaches. Ablation studies show how hyperparameters controlling reuse of low-frequency content and masking granularity influence fidelity and diversity, and attention control further improves large-area edits. Code will be released to enable broader adoption and benchmarking.

Abstract

Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

TL;DR

This work tackles training-free text-guided image editing by removing the dependency on inversion and its fidelity pitfalls. It introduces AREdit, a Visual AutoRegressive Modeling (VAR) framework backed by Infinity-2B, that uses randomness caching, adaptive fine-grained masking, and token re-assembly to realize precise, local edits while preserving non-edited regions. The method achieves high-fidelity edits with fast inference, demonstrated on the PIE-Bench dataset with performance comparable to or better than diffusion- and rectified-flow-based approaches. Ablation studies show how hyperparameters controlling reuse of low-frequency content and masking granularity influence fidelity and diversity, and attention control further improves large-area edits. Code will be released to enable broader adoption and benchmarking.

Abstract

Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

Paper Structure

This paper contains 13 sections, 7 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: AREdit for Text-Guided Image Editing. It can effectively handle a variety of editing tasks for both artificial and natural images, e.g., object removal (as in examples a), object addition (b, c), attribute modification (b, d), and style alteration (c, f). Our method excels at preserving unrelated areas of the image while offering a flexible trade-off with generative capacity. Remarkably fast, our approach processes a 1080p input image in 2.5 seconds for the first run and $\sim$1.2 second for subsequent runs on an A100 GPU.
  • Figure 2: The overall framework of the proposed method built on the pretrained Infinity han2024infinity. Given an input image and its text prompt, we first cache the bit labels $\mathbf{R}_{queue}$ and probability distributions $\mathbf{P}_{queue}$ for editing using a feedforward evaluation as in Algorithm \ref{['alg:caching']}. During editing, for steps where $k \leq \gamma$, cached bit labels are reused to preserve the overall appearance and structure of the image. For steps where $k > \gamma$, we compute a fine-grained adaptive mask, $\mathbf{M}_k$, based on the probability distributions $\mathbf{P}_k$ and $\mathbf{P}_k^{tgt}$. Here, $\mathbf{P}_k$ is from the cached distributions $\mathbf{P}_{cache}$, and $\mathbf{P}_k^{tgt}$ is predicted by conditioning on the target prompt, $t_T$. The fine-grained mask $\mathbf{M}_k \in \mathbb{R}^{(h_k \times w_k) \times d}$ is then employed to blend the cached bit labels $\mathbf{R}_k$ with the randomly sampled ones $\mathbf{R}_k'$ from the distribution $\mathbf{P}_k^{tgt}$. Finally, the decoder generates the edited image using the content $(\mathbf{R}_1, ..., \mathbf{R}_\gamma, \mathbf{R}^{tgt}_{k+1}, ..., \mathbf{R}^{tgt}_{K})$. For illustration purposes, we use $K=3$ and $\gamma=2$ as an example.
  • Figure 3: Visual comparisons of text-guided image editing results from AREdit (Ours), RFInversion rout2025semantic, MasaCtrl cao2023masactrl, Prompt2Prompt hertz2022prompt, and LEdits++ brack2024ledits. The original image and the source/target prompts or editing instructions are provided. Benefiting from the design of the proposed method and the nature of visual autoregressive models, the proposed method achieves superior performance in detail preservation in the editing-unrelated areas and exhibits a strong ability to follow instructions.
  • Figure 4: An ablation study to show the effects of our two hyperparameters, $\gamma$ and $\tau$, varying from $1$ to $3$ and $0.4$ to $0.2$ , respectively. The hyperparameter $\gamma$ primarily governs the preservation of low-frequency details, while $\tau$ controls the preservation of high-frequency details, given certain low-frequency information, i.e., a specific value of $\gamma$.
  • Figure 5: Comparison between the proposed adaptive fine-grained mask (d) and the spatial-wise mask (b, c, e, f) with different hyper-parameters. The $\mathbb{B}^{h_k \times w_k \times d}$ adaptive fine-grained mask enables more precise control over the edited image and preserves more information compared to the spatial-wise mask $\mathbb{B}^{h_k \times w_k}$. Editing using the spatial-wise mask fails in the cases like style transfers.
  • ...and 3 more figures