Table of Contents
Fetching ...

Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

Shuo Zhang, Wenzhuo Wu, Huayu Zhang, Jiarong Cheng, Xianghao Zang, Chao Ban, Hao Sun, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

TL;DR

GeoEdit advances geometric image editing by integrating a geometry-aware Geometric Transformation module and Effects-Sensitive Attention into a diffusion-transformer in-context inpainting framework. It enables precise object manipulation (translation, rotation, scaling) and photorealistic lighting/shadow effects, supported by RS-Objects, a large 120k+ dataset. Empirical results on GeoBench show consistent gains over prior methods in geometric accuracy, realism, and user-perceived quality, with theoretical support for ESA improving attention alignment. The approach offers a scalable, non-finetuning solution for complex scene editing with strong generalization to 2D and 3D transformations.

Abstract

Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.

Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

TL;DR

GeoEdit advances geometric image editing by integrating a geometry-aware Geometric Transformation module and Effects-Sensitive Attention into a diffusion-transformer in-context inpainting framework. It enables precise object manipulation (translation, rotation, scaling) and photorealistic lighting/shadow effects, supported by RS-Objects, a large 120k+ dataset. Empirical results on GeoBench show consistent gains over prior methods in geometric accuracy, realism, and user-perceived quality, with theoretical support for ESA improving attention alignment. The approach offers a scalable, non-finetuning solution for complex scene editing with strong generalization to 2D and 3D transformations.

Abstract

Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.
Paper Structure (29 sections, 1 theorem, 26 equations, 11 figures, 8 tables)

This paper contains 29 sections, 1 theorem, 26 equations, 11 figures, 8 tables.

Key Result

Theorem 3.1

Let $A^\star$ be an ideal attention map, and $\rho$ be its threshold for discriminating critical/non-critical regions. Here $A^\star$ has several necessary conditions defined in Appendix def_ideal_A. Based on this, if we have $\rho \geq 1/|\mathcal{T}^{(Q)}_{\mathrm{edit}}|$, then the following stat where $D_{\mathrm{KL}}$ denotes KL divergence; $A^\star_{\cdot j}$ denotes the attention distributi

Figures (11)

  • Figure 1: Our method accurately performs geometric edits including translation, rotation, scaling, and their combinations (e.g., translation combined with rotation and scaling), while achieving reliable generation of lighting and shadow effects to ensure realistic editing results.
  • Figure 2: The framework of proposed GeoEdit, built upon an in-context inpainting paradigm, consists of a Diffusion Transformer Module that integrates two key components: (1) Geometric Transformation for object editing (translation, rotation, and scaling), and (2) Effects-Sensitive Attention for modeling intricate lighting and shadow effects.
  • Figure 3: Comparison of attention strategies: standard; Hard Modulation; Ours.
  • Figure 4: The rendering-synthesis pipeline for building our RS-Objects dataset.
  • Figure 5: Qualitative comparison with different editing approaches on the 2D-edits of GeoBench.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 3.1