Table of Contents
Fetching ...

DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

TL;DR

DisCo addresses the identity crisis in multi-human text-to-image generation by directly optimizing identity diversity through reinforcement learning. It finetunes flow-matching models with GRPO using a compositional reward that enforces intra-image diversity $r_{ ext{img}}^{d}$, group-wise diversity $r_{ ext{grp}}^{d}$, count accuracy $r_{ ext{img}}^{c}$, and quality $r_{ ext{img}}^{q}$, along with a single-stage curriculum to scale from 2 to $N_{ ext{max}}$ people. The method achieves state-of-the-art identity diversity and accurate person counts across DiverseHumans and MultiHuman-TestBench, while maintaining perceptual quality, and shows strong generalization across base models (Flux and Krea) and toward proprietary baselines. By providing an annotation-free, scalable solution, DisCo closes the long-standing gap in differentiating identities in multi-human generation and enables reliable synthetic data and creative applications. The work advances practical deployment by balancing generation fidelity with robust identity spread and offers avenues for extending to videos, other identities, and fairness-focused analyses.

Abstract

State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

TL;DR

DisCo addresses the identity crisis in multi-human text-to-image generation by directly optimizing identity diversity through reinforcement learning. It finetunes flow-matching models with GRPO using a compositional reward that enforces intra-image diversity , group-wise diversity , count accuracy , and quality , along with a single-stage curriculum to scale from 2 to people. The method achieves state-of-the-art identity diversity and accurate person counts across DiverseHumans and MultiHuman-TestBench, while maintaining perceptual quality, and shows strong generalization across base models (Flux and Krea) and toward proprietary baselines. By providing an annotation-free, scalable solution, DisCo closes the long-standing gap in differentiating identities in multi-human generation and enables reliable synthetic data and creative applications. The work advances practical deployment by balancing generation fidelity with robust identity spread and offers avenues for extending to videos, other identities, and fairness-focused analyses.

Abstract

State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

Paper Structure

This paper contains 65 sections, 19 equations, 9 figures, 6 tables, 3 algorithms.

Figures (9)

  • Figure 1: DisCo enables identity-consistent multi-human generation. (a) SOTA methods often produce duplicate or inconsistent faces, while (b) DisCo generates distinct, diverse identities. (c) Quantitative results show clear gains in Count Accuracy, Unique Face Accuracy, Identity Spread, and Overall quality(HPSv2 score).
  • Figure 2: The Identity Crisis. Observe the images carefully, which have been generated by the recent SOTA text-to-image methods. From an initial glance, they look great. However, can you spot the issue?
  • Figure 3: DisCo training overview. Our method fine-tunes text-to-image models using Flow-GRPO with a compositional reward. Given a prompt, the model generates a group of images evaluated by four components: (1) Intra-Image Diversity penalizes duplicate identities within images, (2) Group-wise Diversity promotes variation across the group, (3) Count Accuracy enforces correct person count, and (4) HPS Quality ensures prompt alignment and visual fidelity. The combined reward guides GRPO updates to improve identity consistency and diversity.
  • Figure 4: Performance vs. number of people. We evaluate (a) Unique Face Accuracy, (b) Count Accuracy, and (c) HPSv2 across varying face counts. Error bars show 95% confidence intervals. DisCo(Flux)in Green consistently performs well across all metrics, maintaining high accuracy as face count increases.
  • Figure 5: DisCo vs. Related WorkDisCo finetuning improves performance over current SOTA methods to consistently generate accurate number of people without overlapping identity. It also maintains high perceptual quality while accurately following input prompts.
  • ...and 4 more figures