NEP: Autoregressive Image Editing via Next Editing Token Prediction
Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li
TL;DR
This work tackles localized, text-guided image editing by reframing the task as Next Editing-token Prediction (NEP), which regenerates only editing regions to avoid unwanted changes to the rest of the image. It introduces RLlamaGen, an any-order autoregressive T2I model trained in a two-stage regime to enable arbitrary-region editing and zero-shot capabilities, and NEP, which conditions editing on text, source image, and a masking region, with test-time iterative refinement. Empirical results on MagicBrush and Emu Edit show state-of-the-art region-based editing and competitive free-form editing, with robust zero-shot editing demonstrated in pretraining. The approach reduces wasted computation, supports test-time scaling, and offers a flexible, controllable editing framework with practical implications for efficient image editing at scale.
Abstract
Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/
