MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li
TL;DR
MultiCrafter tackles the challenge of high-fidelity multi-subject image generation aligned with human preferences by decoupling the learning process into fidelity-focused pre-training and preference-driven post-training. It introduces Identity-Disentangled Attention Regularization (IDAR) with MoE-LoRA to suppress attention leakage and distinguish subject regions, followed by Identity-Preserving Preference Optimization (IPPO) that uses a stable Group Sequence Policy Optimization (GSPO) objective and a Hungarian-based Multi-ID Alignment Reward to optimize aesthetics, text alignment, and identity fidelity. The two-stage framework, backed by a large, carefully constructed dataset and an online RL setup, achieves state-of-the-art results in subject fidelity while maintaining strong alignment with human preferences, across both multi-human and multi-object generation. This approach provides a practical pathway to reliable, high-quality personalized generation with scalable attention control and robust evaluation of multi-subject fidelity.
Abstract
Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.
