Table of Contents
Fetching ...

MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

Pengxiang Cai, Mengyang Li

Abstract

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.

MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

Abstract

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
Paper Structure (20 sections, 16 equations, 6 figures, 1 table)

This paper contains 20 sections, 16 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Construction pipeline of the MSI dataset. Key stages include COCO image filtering based on subject count and size, establishment of a multi-view reference subject pool, generation of composite layout maps using COCO annotations, and assembly of final training tuples.
  • Figure 2: Architecture of MS-CustomNet. The model processes reference subject images ($\{x_{ref,k}\}^K_{k=1}$), textual prompts ($\mathcal{T}$), and layout maps ($M_L$). Category-aware subject features ($f_{s,k}$) are generated and, along with text and location cues, condition a latent diffusion model to synthesize the customized image.
  • Figure 3: Qualitative assessment of single-subject customization. The figure showcases the MS-CustomNet's proficiency in preserving subject identity and integrating subjects cohesively into diverse scenes, guided by varied textual prompts.
  • Figure 4: Qualitative assessment of multi-subject customization. These examples illustrate the MS-CustomNet's capacity to generate intricate scenes featuring multiple customized subjects, effectively preserving individual identities while adhering to user-defined compositional and spatial relationship.
  • Figure 5: Bar chart comparing various metrics under different ablation configurations. The horizontal axis shows the four configurations in order: baseline, introduction of $M_L$, introduction of DST, and introduction of CLSQ.
  • ...and 1 more figures