Table of Contents
Fetching ...

MoEdit: On Learning Quantity Perception for Multi-object Image Editing

Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan

TL;DR

MoEdit tackles the challenge of editing multi-object images while maintaining consistent object quantities by introducing two modules, FeCom and QTTN, that operate without auxiliary tools. FeCom binds quantity- and object-level prompts to CLIP features to reduce attribute interlacing, while QTTN extracts and injects quantity-aware signals into a specific U-Net block to preserve quantity perception during editing. The approach, built on SDXL with a CLIP encoder, achieves state-of-the-art results across multiple objective and subjective metrics and demonstrates strong qualitative performance in attribute preservation and editability. These advances enable more reliable, scalable multi-object editing with improved aesthetic and semantic consistency, though 3D-aware limitations remain a topic for future work.

Abstract

Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.

MoEdit: On Learning Quantity Perception for Multi-object Image Editing

TL;DR

MoEdit tackles the challenge of editing multi-object images while maintaining consistent object quantities by introducing two modules, FeCom and QTTN, that operate without auxiliary tools. FeCom binds quantity- and object-level prompts to CLIP features to reduce attribute interlacing, while QTTN extracts and injects quantity-aware signals into a specific U-Net block to preserve quantity perception during editing. The approach, built on SDXL with a CLIP encoder, achieves state-of-the-art results across multiple objective and subjective metrics and demonstrates strong qualitative performance in attribute preservation and editability. These advances enable more reliable, scalable multi-object editing with improved aesthetic and semantic consistency, though 3D-aware limitations remain a topic for future work.

Abstract

Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at https://github.com/Tear-kitty/MoEdit.

Paper Structure

This paper contains 30 sections, 4 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Visual comparisons of our MoEdit with TurboEdit wu2024turboedit. Reference represents input images. Five different images edited by each method are based on five distinct text prompts.
  • Figure 2: The framework of our propose method. The Feature Compensation (FeCom) module uses text prompts with quantity and object information to compensate for the inferior image features extracted by the image encoder of the CLIP. The compensated image features are then processed by the Quantity Attention (QTTN) module to conduct consistent quantity perception, which are injected into the U-Net to control image editing. During training, the text prompts is set to null-text. During inference, the text prompts can be modified to edit images.
  • Figure 3: The illustration of the purpose of FeCom module. Reference denotes the input image. (a) When only the CLIP-encoded image information is used to represent $I_g$ as input to the QTTN module, the attributes of foxes are either lost or shifted towards the attributes of rabbits, resulting in attributes aliasing. (b) The three images illustrate the results of adding three different Gaussian noises, $\mathcal{N}(0,1)$, to $CLIP(I)$. This process alters the object attributes, background, and style while maintaining the structure and clarity of the image.
  • Figure 4: The framework of QTTN module. The Extraction module perceives the information of each object both individually and part of the whole image. Subsequently, an attention mechanism is employed to convert this information into a format compatible with the U-Net architecture, thereby ensuring that the consistent quantity perception provided by the QTTN module effectively controls the editing process.
  • Figure 5: Qualitative comparisons. Blip-diffusion li2024blip struggles with editability and stability, while IP-Adapter ye2023ip improves clarity but lacks robustness in quantity preservation. MS-diffusion wang2024ms handles few objects well but fails with larger numbers. Emu2 sun2024generative ensures structural consistency but limits editability. TurboEdit wu2024turboedit balances tasks yet falls short in quantity and style handling. Finally, MoEdit successfully achieves a well-balanced performance in quantity consistency, semantic accuracy, stylistic consistency, and aesthetic quality.
  • ...and 15 more figures