Table of Contents
Fetching ...

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Peng Ye, Bangyin Xiang, Zhibo Wang, Wei Cheng, Gang Yu, Tao Chen

TL;DR

RegionE tackles the inefficiency of diffusion-based instruction-based image editing by exploiting region-specific editing trajectories. It introduces Adaptive Region Partition to separate edited and unedited regions, Region-Aware Generation with Region-Instruction KV Cache and Adaptive Velocity Decay Cache to accelerate region-wise denoising, and a stabilization/smoothing workflow to maintain quality. Empirical results show end-to-end speedups around 2×–2.6× across three open-source IIE bases with negligible degradation in PSNR/SSIM and GPT-4o-based perceptual metrics, confirming strong fidelity and generalizability. The approach is training-free and broadly applicable, offering practical gains for real-time editing tasks while preserving semantic and perceptual integrity.

Abstract

Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

TL;DR

RegionE tackles the inefficiency of diffusion-based instruction-based image editing by exploiting region-specific editing trajectories. It introduces Adaptive Region Partition to separate edited and unedited regions, Region-Aware Generation with Region-Instruction KV Cache and Adaptive Velocity Decay Cache to accelerate region-wise denoising, and a stabilization/smoothing workflow to maintain quality. Empirical results show end-to-end speedups around 2×–2.6× across three open-source IIE bases with negligible degradation in PSNR/SSIM and GPT-4o-based perceptual metrics, confirming strong fidelity and generalizability. The approach is training-free and broadly applicable, offering practical gains for real-time editing tasks while preserving semantic and perceptual integrity.

Abstract

Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.

Paper Structure

This paper contains 13 sections, 13 equations, 10 figures, 18 tables, 1 algorithm.

Figures (10)

  • Figure 1: Trajectories of different regions in the IIE task. In unedited regions, the trajectory is nearly linear, allowing early-stage velocity to provide a reliable estimate of the multi-step denoised images, including the final result. In contrast, edited regions exhibit curved trajectories, making the final image harder to predict. Despite this, the velocity between consecutive timesteps remains consistent.
  • Figure 2: Comparison between traditional DiT and DiT in IIE (a, b). Symbolic visualization of the denoising process (c). L1 and cosine similarities of velocities between adjacent timesteps during denoising (d, e). Cosine similarity between velocities after $t_{21}$ in edited and unedited regions with $\bm v_{21}$ (f). Cross-step key similarity (g) and cross-step similarity of instruction-related keys (h).
  • Figure 3: Overview of the RegionE. RegionE consists of three stages: STS, RAGS, and SMS. In the STS, no acceleration is applied due to unstable DiT outputs, and all KV values are cached at the final step. In the RAGS, an Adaptive Region Partition distinguishes between edited and unedited regions: unedited regions are denoised in one step, while edited regions are generated iteratively. This iterative generation process leverages RIKVCache for injecting global information and AVDCache for acceleration. Certain forced-update steps aggregate the full image to refresh RIKVCache with complete DiT computation. Finally, in the SMS, several full denoising steps are performed to eliminate artifacts along the boundaries between edited and unedited regions.
  • Figure 4: Examples of edited images by RegionE and baseline on Step1X-Edit-v1p1.
  • Figure 5: Pipeline Based on Residual Cache.
  • ...and 5 more figures