Table of Contents
Fetching ...

When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

Zhihan Chen, Yuhuan Zhao, Yijie Zhu, Xinyu Yao

Abstract

Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.

When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

Abstract

Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.

Paper Structure

This paper contains 19 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Catastrophic Identity Collapse in Multi-Subject Personalization. (Left) Our quantitative analysis reveals the illusion of scalability: as the number of interacting subjects ($N$) increases, the fine-grained identity fidelity (DINOv2) plummets, and the Subject Collapse Rate (SCR) skyrockets to $>95\%$. (Middle) At $N=2$, state-of-the-art models (e.g., MOSAIC mosaic2024) successfully maintain distinct identities and structural integrity. (Right) When evaluated on our proposed stress-test benchmark at $N=8$, models experience severe attention leakage, resulting in identity bleeding and homogenization (generating clones of a single dominant identity).
  • Figure 2: Multi-Subject Benchmark Construction. Our pipeline samples identities from a unified subject pool (left) to populate prompts of increasing complexity (2 to 10 subjects). Prompts are systematically categorized into Neutral, Occlusion, and Interaction scenarios (right) to isolate and test specific failure modes in attention routing and geometric reasoning.
  • Figure 3: Subject Collapse Rate (SCR). Unlike average similarity scores which mask individual failures, SCR explicitly counts the proportion of subjects whose DINOv2 identity similarity falls below a strict threshold $\tau$. This provides a more realistic measure of multi-subject entanglement.
  • Figure 4: Performance Trends across Subject Counts. (Left) DINOv2 identity similarity exhibits a precipitous drop for all models as scenes become denser. (Right) Subject Collapse Rate (SCR@0.4) skyrockets from $\sim$50% at 2 subjects to nearly 100% at 8-10 subjects, highlighting a fundamental scalability bottleneck.
  • Figure 5: Comprehensive Metric Radar (2 Subjects). A multi-dimensional comparison of MOSAIC, XVerse, and PSR. While all models maintain high semantic alignment (CLIP-T), MOSAIC shows a distinct advantage in structural (DINO) and fine-grained identity (DINOv2) preservation.
  • ...and 2 more figures