Table of Contents
Fetching ...

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Honghao Cai, Xiangyuan Wang, Yunhao Bai, Tianze Zhou, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

TL;DR

Extensive experiments on two challenging benchmarks demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

Abstract

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

TL;DR

Extensive experiments on two challenging benchmarks demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

Abstract

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.
Paper Structure (27 sections, 9 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Qualitative results of IdGlow on two multi-subject generation tasks. Given a set of reference portrait images (left), IdGlow generates high-fidelity group photos (right) that faithfully preserve each individual's identity while producing coherent, aesthetically pleasing scenes. Top: Task 2---age-transformed group generation, where adult identities are transformed into child-like appearances while maintaining discriminative facial features. Bottom: Task 1---direct group fusion.
  • Figure 2: The architecture of IdGlow-DiT. The model processes variable numbers of reference identities through a unified encoding strategy, forming a concatenated multi-ID sequence. A key innovation is the Dynamics-Aware Gating Module (highlighted in orange), which modulates the intensity of the identity sequence based on the diffusion timestep $t$ and the specific task (, age transformation curves). These gated features are injected into the main generation process via Dual-Stream DiT Blocks, where they serve as keys/values in a specialized Multi-ID Cross-Attention layer, interacting with the noisy latent queries.
  • Figure 3: Task-specific prompt synthesis via the Image-Edit-Prompt model. Given a set of input images and a structured VLM prompt with spatial instructions, our fine-tuned Qwen 3 VL model automatically generates a detailed, spatially precise prompt that specifies subject positions, appearance attributes, and scene composition. This generated prompt then guides the diffusion model to produce a coherent group photo with correct spatial arrangement and identity preservation.
  • Figure 4: Dynamics-aware identity modulation tailored to specific generative tasks. We design dynamic loss schedules aligned with the spectral evolution of the diffusion process ($t$: $1.0 \to 0.0$). (a) Loss Annealing for Group Generation: high identity weight in early stages to establish identity foundation, gradually relaxed for harmonious lighting and pose. (b) Temporal Gating for Age Transformation: identity constraints selectively activated only during the semantic window ($t \in [0.3, 0.6]$), suppressed at early stages for child-like structure and reduced at late stages for smooth skin texture.