DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation
Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli
TL;DR
DisCo addresses the identity crisis in multi-human text-to-image generation by directly optimizing identity diversity through reinforcement learning. It finetunes flow-matching models with GRPO using a compositional reward that enforces intra-image diversity $r_{ ext{img}}^{d}$, group-wise diversity $r_{ ext{grp}}^{d}$, count accuracy $r_{ ext{img}}^{c}$, and quality $r_{ ext{img}}^{q}$, along with a single-stage curriculum to scale from 2 to $N_{ ext{max}}$ people. The method achieves state-of-the-art identity diversity and accurate person counts across DiverseHumans and MultiHuman-TestBench, while maintaining perceptual quality, and shows strong generalization across base models (Flux and Krea) and toward proprietary baselines. By providing an annotation-free, scalable solution, DisCo closes the long-standing gap in differentiating identities in multi-human generation and enables reliable synthetic data and creative applications. The work advances practical deployment by balancing generation fidelity with robust identity spread and offers avenues for extending to videos, other identities, and fairness-focused analyses.
Abstract
State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.
