Table of Contents
Fetching ...

PositionIC: Unified Position and Identity Consistency for Image Customization

Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Song Yang, Xianhua He, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang

TL;DR

PositionIC tackles the challenge of fine-grained spatial control in subject-driven image customization by marrying a scalable data synthesis pipeline (BMPDS) with a layout-aware diffusion framework that decouples layout from identity through a NeRF-inspired Volumetric Weight Regulation and Visibility-Aware Attention. BMPDS automatically generates high-quality, position-annotated multi-subject data, filtered by multi-modal models and LLM-based descriptors to produce the PIC-98K dataset used for training. PositionIC demonstrates state-of-the-art spatial precision and identity consistency on benchmarks such as DreamBench and PositionIC-Bench, with ablations confirming the effectiveness of VAA and data filtering. The work enables precise, occlusion-aware multi-subject placement without extra training overhead, advancing practical, controllable image customization for multi-entity scenes, and provides public data and code for reproducibility.

Abstract

Recent subject-driven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms. To this end, we introduce \modelname{}, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects. Extensive experiments demonstrate \modelname{} achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios. Code and data will be publicly released.

PositionIC: Unified Position and Identity Consistency for Image Customization

TL;DR

PositionIC tackles the challenge of fine-grained spatial control in subject-driven image customization by marrying a scalable data synthesis pipeline (BMPDS) with a layout-aware diffusion framework that decouples layout from identity through a NeRF-inspired Volumetric Weight Regulation and Visibility-Aware Attention. BMPDS automatically generates high-quality, position-annotated multi-subject data, filtered by multi-modal models and LLM-based descriptors to produce the PIC-98K dataset used for training. PositionIC demonstrates state-of-the-art spatial precision and identity consistency on benchmarks such as DreamBench and PositionIC-Bench, with ablations confirming the effectiveness of VAA and data filtering. The work enables precise, occlusion-aware multi-subject placement without extra training overhead, advancing practical, controllable image customization for multi-entity scenes, and provides public data and code for reproducibility.

Abstract

Recent subject-driven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms. To this end, we introduce \modelname{}, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects. Extensive experiments demonstrate \modelname{} achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios. Code and data will be publicly released.

Paper Structure

This paper contains 37 sections, 8 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Results from PositionIC across various controllable image customization tasks.
  • Figure 2: Bidirectional Multi-dimensional Perception Data Synthesis framework. (a) We use Subject200K to train a weak model. (b) Forward generation of multi-subject data pairs. (c) Reverse generation of multi-subject data pairs. (d) We utilize MLLMs to filter out our data pairs.
  • Figure 3: The overall framework of PositionIC.(a) Reference images and prompts are encoded and concatenated with the latent embeddings $z_{t}$, then the whole token sequence is passed to DiT. Each reference image $z_{ref}^{i}$ is only visible for the specific area of latent noise $z_{t}$ in the attention map. (b) The objects' binary mask and the semantic density $\sigma$ are used by VWR to calculate the weighted mask. The dashed boxes in the ray diagram represent the overlapping regions, where the same colors correspond to the weights in the mask.
  • Figure 4: Qualitative comparison of single-subject generation with different methods on DreamBench.
  • Figure 5: Qualitative comparison of multi-subject generation with different methods on DreamBench. We adopt a fixed bounding box (e.g., bottom left and bottom right) for generation.
  • ...and 12 more figures