Table of Contents
Fetching ...

IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang

TL;DR

IMAGHarmony tackles multi-object image editing with strict control over object quantity and spatial layout by introducing a harmony aware (HA) module and a preference-guided noise selection (PNS) strategy within a diffusion-based framework. The HA module explicitly encodes object counts and implicitly encodes layout through a quantity–layout attention and a cross-attention path that injects perception features into a frozen SDXL backbone, enabling structurally faithful edits. PNS stabilizes the diffusion trajectory by selecting semantically aligned initial noise seeds via vision–language matching during both training and inference, reducing layout drift and count errors. HarmonyBench provides a rigorous benchmark for evaluating QL-Edit capabilities, and across class, scene, and style editing tasks, IMAGHarmony achieves state-of-the-art structural and semantic performance using only 200 training examples and 10.6M trainable parameters, demonstrating strong generalization and practical efficiency for multi-object editing. The proposed approach offers a lightweight, plug-and-play solution that decouples structural control from style or content changes, with broad implications for reliable, scalable editing in complex scenes and downstream applications in content creation and visual reasoning. Key contributions include (i) the QL-Edit formulation and the HA module for explicit quantity and implicit layout modeling, (ii) the PNS strategy for stable seed selection guided by VLMs, and (iii) HarmonyBench, a diverse benchmark for multi-object editing evaluation.

Abstract

Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.

IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

TL;DR

IMAGHarmony tackles multi-object image editing with strict control over object quantity and spatial layout by introducing a harmony aware (HA) module and a preference-guided noise selection (PNS) strategy within a diffusion-based framework. The HA module explicitly encodes object counts and implicitly encodes layout through a quantity–layout attention and a cross-attention path that injects perception features into a frozen SDXL backbone, enabling structurally faithful edits. PNS stabilizes the diffusion trajectory by selecting semantically aligned initial noise seeds via vision–language matching during both training and inference, reducing layout drift and count errors. HarmonyBench provides a rigorous benchmark for evaluating QL-Edit capabilities, and across class, scene, and style editing tasks, IMAGHarmony achieves state-of-the-art structural and semantic performance using only 200 training examples and 10.6M trainable parameters, demonstrating strong generalization and practical efficiency for multi-object editing. The proposed approach offers a lightweight, plug-and-play solution that decouples structural control from style or content changes, with broad implications for reliable, scalable editing in complex scenes and downstream applications in content creation and visual reasoning. Key contributions include (i) the QL-Edit formulation and the HA module for explicit quantity and implicit layout modeling, (ii) the PNS strategy for stable seed selection guided by VLMs, and (iii) HarmonyBench, a diverse benchmark for multi-object editing evaluation.

Abstract

Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.

Paper Structure

This paper contains 16 sections, 7 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Editing results on (a) few-object and (b) multi-object scenes. Existing methods struggle to preserve the count, layout, and semantics in multi-object cases, while ours ensures consistent and faithful edits.
  • Figure 2: Overview of the IMAGHarmony framework. Given a source image and a text instruction, we first sample multiple candidate noise seeds and judge their semantic alignment using a vision-language model (VLM). The top-$k$ candidates are selected for inference. The harmony aware (HA) module fuses auxiliary textual and visual features to jointly model object count and spatial layout. Then, a cross attention layer injects these perception features into the UNet backbone without modifying its frozen weights. The preference-guided noise selection (PNS) strategy searches for and selects the best candidate seed to stabilize trajectories across training and inference, ultimately leading to consistent edited images.
  • Figure 3: Examples of training data from HarmonyBench. Each sample includes a source image and a validated count-object caption, verified by YOLO-World cheng2024yoloworldrealtimeopenvocabularyobject and human review.
  • Figure 4: Qualitative comparisons with several state-of-the-art models on the HarmonyBench dataset. A red cross indicates an unsuccessful edit, where the output shows no difference compared to the source image.
  • Figure 5: User study results on HarmonyBench. Our method receives the highest praise from users.
  • ...and 6 more figures