Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Yufei Wang; Lanqing Guo; Zhihao Li; Jiaxing Huang; Pichao Wang; Bihan Wen; Jian Wang

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang

TL;DR

This work tackles training-free text-guided image editing by removing the dependency on inversion and its fidelity pitfalls. It introduces AREdit, a Visual AutoRegressive Modeling (VAR) framework backed by Infinity-2B, that uses randomness caching, adaptive fine-grained masking, and token re-assembly to realize precise, local edits while preserving non-edited regions. The method achieves high-fidelity edits with fast inference, demonstrated on the PIE-Bench dataset with performance comparable to or better than diffusion- and rectified-flow-based approaches. Ablation studies show how hyperparameters controlling reuse of low-frequency content and masking granularity influence fidelity and diversity, and attention control further improves large-area edits. Code will be released to enable broader adoption and benchmarking.

Abstract

Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

TL;DR

Abstract

Training-Free Text-Guided Image Editing with Visual Autoregressive Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)