Table of Contents
Fetching ...

MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli

TL;DR

MultiHuman Testbench tackles the difficulty of generating scenes with multiple distinct humans while preserving identities, proposing a dedicated benchmark with 1,800 prompts and 5,550 reference faces across 1–5 people. It introduces a four-metric evaluation framework and four tasks to comprehensively assess count accuracy, ID similarity, prompt alignment, and action correctness, augmented by pose conditioning and multi-view data. The authors propose training-free enhancements—Unified Regional Isolation and Implicit Assignment—that can be plugged into unified multi-modal architectures to reduce identity leakage and improve region-wise identity control, leading to MH-OmniGen and MH-IR-Diffusion variants. Through benchmarking ~30 methods across four tasks, the work reveals substantial gaps in accurate counting and identity preservation, and it uncovers demographic biases across several attributes, underscoring the need for further methodological advances and ethical considerations.

Abstract

Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1,800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.

MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

TL;DR

MultiHuman Testbench tackles the difficulty of generating scenes with multiple distinct humans while preserving identities, proposing a dedicated benchmark with 1,800 prompts and 5,550 reference faces across 1–5 people. It introduces a four-metric evaluation framework and four tasks to comprehensively assess count accuracy, ID similarity, prompt alignment, and action correctness, augmented by pose conditioning and multi-view data. The authors propose training-free enhancements—Unified Regional Isolation and Implicit Assignment—that can be plugged into unified multi-modal architectures to reduce identity leakage and improve region-wise identity control, leading to MH-OmniGen and MH-IR-Diffusion variants. Through benchmarking ~30 methods across four tasks, the work reveals substantial gaps in accurate counting and identity preservation, and it uncovers demographic biases across several attributes, underscoring the need for further methodological advances and ethical considerations.

Abstract

Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1,800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation. The dataset and evaluation codes will be available at https://github.com/Qualcomm-AI-research/MultiHuman-Testbench.

Paper Structure

This paper contains 27 sections, 5 equations, 9 figures, 9 tables, 3 algorithms.

Figures (9)

  • Figure 1: MultiHuman Testbench. Our MultiHuman Testbench consists of 5,550 IDs across 1,800 samples, including captions describing a scene with of 1-5 humans.
  • Figure 2: Regional Isolation for Unified Architectures. The updates to the attention mask for regional isolation are illustrated in the differences between Fig.b and Fig.d.
  • Figure 3: Data distribution among four major attributes: Ethnicity, Age, and Gender, Status. See Appendix \ref{['sec:appendixdata']} for details.
  • Figure 4: Wordcloud. The graphic shows words from our caption space.
  • Figure 5: Qualitative Results on Multi-Human Generation in the wild. The image shows the best performing methods: UniPortrait, LoRA, GPT-Image-1, OmniGen and MH-OmniGen.
  • ...and 4 more figures