Table of Contents
Fetching ...

Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli

TL;DR

Ar2Can addresses the challenge of reliable multi-human generation by decoupling spatial layout generation from identity-preserving rendering. It introduces two Architect variants for layout and a GRPO-trained Artist with compositional rewards, including Hungarian centroid face matching and ArcFace identity similarity, achieving strong count accuracy and identity preservation on MultiHuman-Testbench using synthetic data. The modular architecture supports trade-offs between speed and accuracy and demonstrates state-of-the-art performance across key metrics, with token sharing and curriculum contributing to efficiency and stability. This framework paves the way for scalable, controllable multi-human generation in real-world applications.

Abstract

Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

TL;DR

Ar2Can addresses the challenge of reliable multi-human generation by decoupling spatial layout generation from identity-preserving rendering. It introduces two Architect variants for layout and a GRPO-trained Artist with compositional rewards, including Hungarian centroid face matching and ArcFace identity similarity, achieving strong count accuracy and identity preservation on MultiHuman-Testbench using synthetic data. The modular architecture supports trade-offs between speed and accuracy and demonstrates state-of-the-art performance across key metrics, with token sharing and curriculum contributing to efficiency and stability. This framework paves the way for scalable, controllable multi-human generation in real-world applications.

Abstract

Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Paper Structure

This paper contains 49 sections, 9 equations, 20 figures, 7 tables, 5 algorithms.

Figures (20)

  • Figure 1: Ar2Can Framework Overview. Our two-stage approach decomposes multi-human generation into spatial planning (Architect) and identity-preserving rendering (Artist). The Architect generates bounding boxes specifying where each person should appear. The Artist renders photorealistic outputs, ensuring faces appear at correct locations while preserving identities.
  • Figure 2: Ar2Can generates highly photorealistic multi-human scenes with 1-5 people while preserving the individual identities. Our two-stage architecture produces natural poses, realistic lighting, and proper spatial arrangements without identity merging or blending artifacts. Please see Appendix E for the respective input images
  • Figure 3: Architecture of the LLM-based Architect-A for layout. Top: response example for spatial layout generation. Bottom: our lightweight LLM extended with special tokens. When the token head predicts <C>, the coordinate head is triggered to regress co-ordinate values $b_{\text{pred}}$, which are then re-embedded through a coordinate embedding head to form instance-specific tokens $\texttt{<C}_\texttt{i}\texttt{>}$.
  • Figure 4: Artist training pipeline with GRPO. Given a layout from the Architect, reference images, and text prompt, we generate multiple samples and optimize using compositional rewards: count accuracy, prompt alignment/ aesthetic quality (HPSv3), spatially-grounded face matching via Hungarian correspondence and pose correction.
  • Figure 5: We drop tokens from canvas patches which don't contain any input information. Additionally, tokens in overlapping regions receive identical positional encodings, enabling the model to learn natural occlusion and depth ordering.
  • ...and 15 more figures