Table of Contents
Fetching ...

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

Paper Structure

This paper contains 12 sections, 14 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: SAVAREdit for Text-Guided Image Editing. Compared with AREdit wang2025training, our method maintains consistent spatial structures with both local and global edits.
  • Figure 2: Method overview. Given a source image $I_{\text{src}}$ and its corresponding text description $T_{\text{src}}$, our framework generates the edited image $I_{\text{edit}}$ according to the target text $T_{\text{tgt}}$. The tokenizer $E$ encodes the input $I_{\text{src}}$ into multi-scale residuals $\{R_1,\ldots,R_K\}$. For clarity, we illustrate the inference pipeline at the $k$-th scale. The framework adopts a dual-branch architecture, where the source branch takes $\tilde{F}_{k-1}^{\text{src}}$ and the target branch takes $\tilde{F}_{k-1}^{\text{tgt}}$, producing probability distributions $P_k^{\text{src}}$ and $P_k^{\text{tgt}}$. The intermediate feature maps---illustrated in (b)---are selectively injected from the source branch into the target branch (\ref{['subsec:SAVAR']}) with learnable weights (\ref{['subsec:AFI']}). The predicted distributions $P_{k}^{\text{tgt}}$ are fed into the CFTL module (\ref{['subsec:CFTL']}) to obtain a refined editing mask $\hat{M}_k$, which is then used by token reassembly (TR) to produce the prediction $R_k^{\text{tgt}}$ at the $k$-th scale. Finally, multi-scale predictions $\{R_1,\ldots,R_{\gamma},R_{\gamma+1}^{\text{tgt}},\ldots,R_{K}^{\text{tgt}}\}$ are jointly decoded to produce the final edited image $I_{\text{edit}}$. Here, $\gamma$ denotes the number of source scales reused in the target branch.
  • Figure 3: Effect of CFG on background preservation and editing fidelity. A higher CFG yields better editing fidelity but weaker background preservation, whereas a lower CFG produces the opposite effect. Our hybrid method achieves a good balance. The number in each grid indicates the amount of tokens that need to be replaced at each position.
  • Figure 4: VAR features and attention maps. PCA visualization of intermediate representations and attention maps in a VAR block. Cross-attention (CA) offers rough object localization akin to diffusion models, but VAR self-attention (SA) does not display the spatial affinity structure typically observed in diffusion-based methods.
  • Figure 5: Dependency analysis for feature injection.Top: Experimental results of injecting features at different scales and layers, where the numbers denote layer and scale indices. Injecting features at the 0-th layer and at scales 5--8 leads to markedly better structural preservation. Bottom: Feature injection ratios after genetic algorithm optimization.
  • ...and 4 more figures