PositionIC: Unified Position and Identity Consistency for Image Customization
Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Song Yang, Xianhua He, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang
TL;DR
PositionIC tackles the challenge of fine-grained spatial control in subject-driven image customization by marrying a scalable data synthesis pipeline (BMPDS) with a layout-aware diffusion framework that decouples layout from identity through a NeRF-inspired Volumetric Weight Regulation and Visibility-Aware Attention. BMPDS automatically generates high-quality, position-annotated multi-subject data, filtered by multi-modal models and LLM-based descriptors to produce the PIC-98K dataset used for training. PositionIC demonstrates state-of-the-art spatial precision and identity consistency on benchmarks such as DreamBench and PositionIC-Bench, with ablations confirming the effectiveness of VAA and data filtering. The work enables precise, occlusion-aware multi-subject placement without extra training overhead, advancing practical, controllable image customization for multi-entity scenes, and provides public data and code for reproducibility.
Abstract
Recent subject-driven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms. To this end, we introduce \modelname{}, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects. Extensive experiments demonstrate \modelname{} achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios. Code and data will be publicly released.
