Table of Contents
Fetching ...

NEP: Autoregressive Image Editing via Next Editing Token Prediction

Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li

TL;DR

This work tackles localized, text-guided image editing by reframing the task as Next Editing-token Prediction (NEP), which regenerates only editing regions to avoid unwanted changes to the rest of the image. It introduces RLlamaGen, an any-order autoregressive T2I model trained in a two-stage regime to enable arbitrary-region editing and zero-shot capabilities, and NEP, which conditions editing on text, source image, and a masking region, with test-time iterative refinement. Empirical results on MagicBrush and Emu Edit show state-of-the-art region-based editing and competitive free-form editing, with robust zero-shot editing demonstrated in pretraining. The approach reduces wasted computation, supports test-time scaling, and offers a flexible, controllable editing framework with practical implications for efficient image editing at scale.

Abstract

Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/

NEP: Autoregressive Image Editing via Next Editing Token Prediction

TL;DR

This work tackles localized, text-guided image editing by reframing the task as Next Editing-token Prediction (NEP), which regenerates only editing regions to avoid unwanted changes to the rest of the image. It introduces RLlamaGen, an any-order autoregressive T2I model trained in a two-stage regime to enable arbitrary-region editing and zero-shot capabilities, and NEP, which conditions editing on text, source image, and a masking region, with test-time iterative refinement. Empirical results on MagicBrush and Emu Edit show state-of-the-art region-based editing and competitive free-form editing, with robust zero-shot editing demonstrated in pretraining. The approach reduces wasted computation, supports test-time scaling, and offers a flexible, controllable editing framework with practical implications for efficient image editing at scale.

Abstract

Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/

Paper Structure

This paper contains 22 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our approach avoids full-image generation and does not introduce unintended changes as the previous diffusion-model-based editing approach zhang2024magicbrush.
  • Figure 2: Overview of Next-Editing-token Prediction. The input sequence is comprised of: 1) text embeddings, extracted from FLAN-T5, 2) source image embeddings, tokenized by VQGAN, and 3) mask embeddings, a sequence of interleaved editing and non-editing embeddings. The output editing tokens (in raster scan order) are filled back to the source image based on the editing mask. $PE_{i}$ denotes the learned positional embeddings that specify the token generation order.
  • Figure 3: Visualized ablation on ERC. This demonstrates that removing Editing Region Conditioning increases the editing model's change to refuse to modify the source image. Best viewed zoomed in and in color.
  • Figure 4: Comparative editing results. This demonstrates that our approach can make more faithful edits to source images, either by updating objects (case $\#1$, $\#2$), or making fine-grained edits (case $\#3$). Best viewed zoomed in and in color.
  • Figure 5: Examples of RLlamaGen's zero-shot editing capability. It can make fine-grained edits such as adding external objects (ice cream in example #1), changing the state of input objects (cabinet door open $\rightarrow{}$ closed in example #2), changing the semantics (chips $\rightarrow{}$ fries in example #3), and changing the color (white $\rightarrow{}$ red in example #4). Best viewed zoomed in and in color.
  • ...and 1 more figures