IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang
TL;DR
IMAGHarmony tackles multi-object image editing with strict control over object quantity and spatial layout by introducing a harmony aware (HA) module and a preference-guided noise selection (PNS) strategy within a diffusion-based framework. The HA module explicitly encodes object counts and implicitly encodes layout through a quantity–layout attention and a cross-attention path that injects perception features into a frozen SDXL backbone, enabling structurally faithful edits. PNS stabilizes the diffusion trajectory by selecting semantically aligned initial noise seeds via vision–language matching during both training and inference, reducing layout drift and count errors. HarmonyBench provides a rigorous benchmark for evaluating QL-Edit capabilities, and across class, scene, and style editing tasks, IMAGHarmony achieves state-of-the-art structural and semantic performance using only 200 training examples and 10.6M trainable parameters, demonstrating strong generalization and practical efficiency for multi-object editing. The proposed approach offers a lightweight, plug-and-play solution that decouples structural control from style or content changes, with broad implications for reliable, scalable editing in complex scenes and downstream applications in content creation and visual reasoning. Key contributions include (i) the QL-Edit formulation and the HA module for explicit quantity and implicit layout modeling, (ii) the PNS strategy for stable seed selection guided by VLMs, and (iii) HarmonyBench, a diverse benchmark for multi-object editing evaluation.
Abstract
Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
