Table of Contents
Fetching ...

Multi-Human Mesh Recovery with Transformers

Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

TL;DR

This paper tackles multi-person monocular human mesh recovery by moving beyond region-based cropping to a whole-image, transformer-based approach. It introduces a streamlined architecture that uses multi-scale features, deformable attention for focused processing, and a novel relative joint supervision to capture inter-person relations. The method achieves state-of-the-art performance on CHI3D, Hi4D, and BEDLAM, with particularly notable gains in joint-level metrics that reflect accurate relative positioning of people in a scene. The work demonstrates the importance of global context and targeted attention in multi-person HMR and suggests promising directions for further improvements such as contact optimization in crowded scenes.

Abstract

Conventional approaches to human mesh recovery predominantly employ a region-based strategy. This involves initially cropping out a human-centered region as a preprocessing step, with subsequent modeling focused on this zoomed-in image. While effective for single figures, this pipeline poses challenges when dealing with images featuring multiple individuals, as different people are processed separately, often leading to inaccuracies in relative positioning. Despite the advantages of adopting a whole-image-based approach to address this limitation, early efforts in this direction have fallen short in performance compared to recent region-based methods. In this work, we advocate for this under-explored area of modeling all people at once, emphasizing its potential for improved accuracy in multi-person scenarios through considering all individuals simultaneously and leveraging the overall context and interactions. We introduce a new model with a streamlined transformer-based design, featuring three critical design choices: multi-scale feature incorporation, focused attention mechanisms, and relative joint supervision. Our proposed model demonstrates a significant performance improvement, surpassing state-of-the-art region-based and whole-image-based methods on various benchmarks involving multiple individuals.

Multi-Human Mesh Recovery with Transformers

TL;DR

This paper tackles multi-person monocular human mesh recovery by moving beyond region-based cropping to a whole-image, transformer-based approach. It introduces a streamlined architecture that uses multi-scale features, deformable attention for focused processing, and a novel relative joint supervision to capture inter-person relations. The method achieves state-of-the-art performance on CHI3D, Hi4D, and BEDLAM, with particularly notable gains in joint-level metrics that reflect accurate relative positioning of people in a scene. The work demonstrates the importance of global context and targeted attention in multi-person HMR and suggests promising directions for further improvements such as contact optimization in crowded scenes.

Abstract

Conventional approaches to human mesh recovery predominantly employ a region-based strategy. This involves initially cropping out a human-centered region as a preprocessing step, with subsequent modeling focused on this zoomed-in image. While effective for single figures, this pipeline poses challenges when dealing with images featuring multiple individuals, as different people are processed separately, often leading to inaccuracies in relative positioning. Despite the advantages of adopting a whole-image-based approach to address this limitation, early efforts in this direction have fallen short in performance compared to recent region-based methods. In this work, we advocate for this under-explored area of modeling all people at once, emphasizing its potential for improved accuracy in multi-person scenarios through considering all individuals simultaneously and leveraging the overall context and interactions. We introduce a new model with a streamlined transformer-based design, featuring three critical design choices: multi-scale feature incorporation, focused attention mechanisms, and relative joint supervision. Our proposed model demonstrates a significant performance improvement, surpassing state-of-the-art region-based and whole-image-based methods on various benchmarks involving multiple individuals.
Paper Structure (26 sections, 7 equations, 3 figures, 4 tables)

This paper contains 26 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (Top) The majority of current human mesh recovery techniques adopt a region-based approach. They begin by isolating human-centric regions and processing them individually. This often results in inaccuracies in the relative positioning of multiple individuals. (Bottom) We advocate for the adoption of whole-image-based approach, where all people are processed simultaneously. This direction is less explored and early works are outperformed by recent region-based methods by a large margin. In this work, we introduce a new model that incorporates several crucial design choices, showcasing a substantial improvement in accurately modeling all individuals and surpassing the performance of state-of-the-art region-based models.
  • Figure 2: Our proposed approach adopts a streamlined transformation-based design, processing the entire image and generating meshes for all individuals simultaneously. It incorporates three crucial design elements that effectively prioritize essential regions and model the relative positions of humans, ultimately leading to improved performance compared to existing methods in both region-based and whole-image-based approaches.
  • Figure 3: Qualitative comparisons on CHI3D fieraru2020three (top 2 rows), Hi4D yin2023hi4d (middle 2 rows), and BEDLAM black2023bedlam (bottom 2 rows). Meshes predicted by our method are overall more accurate in terms of the relative locations and orientations compared to top-performing region-based and whole-image-based baselines.