Table of Contents
Fetching ...

MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment

Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li

TL;DR

MultiCrafter tackles the challenge of high-fidelity multi-subject image generation aligned with human preferences by decoupling the learning process into fidelity-focused pre-training and preference-driven post-training. It introduces Identity-Disentangled Attention Regularization (IDAR) with MoE-LoRA to suppress attention leakage and distinguish subject regions, followed by Identity-Preserving Preference Optimization (IPPO) that uses a stable Group Sequence Policy Optimization (GSPO) objective and a Hungarian-based Multi-ID Alignment Reward to optimize aesthetics, text alignment, and identity fidelity. The two-stage framework, backed by a large, carefully constructed dataset and an online RL setup, achieves state-of-the-art results in subject fidelity while maintaining strong alignment with human preferences, across both multi-human and multi-object generation. This approach provides a practical pathway to reliable, high-quality personalized generation with scalable attention control and robust evaluation of multi-subject fidelity.

Abstract

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.

MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment

TL;DR

MultiCrafter tackles the challenge of high-fidelity multi-subject image generation aligned with human preferences by decoupling the learning process into fidelity-focused pre-training and preference-driven post-training. It introduces Identity-Disentangled Attention Regularization (IDAR) with MoE-LoRA to suppress attention leakage and distinguish subject regions, followed by Identity-Preserving Preference Optimization (IPPO) that uses a stable Group Sequence Policy Optimization (GSPO) objective and a Hungarian-based Multi-ID Alignment Reward to optimize aesthetics, text alignment, and identity fidelity. The two-stage framework, backed by a large, carefully constructed dataset and an online RL setup, achieves state-of-the-art results in subject fidelity while maintaining strong alignment with human preferences, across both multi-human and multi-object generation. This approach provides a practical pathway to reliable, high-quality personalized generation with scalable attention control and robust evaluation of multi-subject fidelity.

Abstract

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.

Paper Structure

This paper contains 24 sections, 12 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Data processing pipeline for customized multi-human image generation.
  • Figure 2: Visual comparison of attention maps. The ICL-based method UNO (left), fails to preserve subject fidelity. This is due to attention bleeding, where the double block's attention regions for each subject are entangled, leading to attribute leakage. Our method overcomes this problem and maintains subject fidelity.
  • Figure 2: Aesthetic Score Distribution of Training Data. The histogram illustrates the frequency of HPS v2 scores within our training dataset. The distribution is centered around a mean of 0.2552.
  • Figure 3: Overall pipeline of MultiCrafter. Our framework is built on two core innovations: (Top) Identity-Disentangled Attention Regularization uses positional supervision to prevent attribute leakage and the MoE-LORA architecture to boost model capacity for diverse scenarios; and (Bottom) the Identity-Preserving Preference Alignment framework employs a novel online reinforcement learning strategy with a Multi-ID Alignment Reward and the stable GSPO algorithm to align the model with human preferences.
  • Figure 3: Visualization for part of our multi-human evaluation benchmarks.
  • ...and 11 more figures