Table of Contents
Fetching ...

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

Yuto Matsubara, Ko Nishino

TL;DR

HeatFormer tackles the problem of accurate human mesh recovery from fixed multiview cameras by reframing SMPL parameter estimation as neural optimization. It introduces a heatmap-based representation and a Transformer-based HeatEncoder/Decoder that iteratively refines SMPL parameters $\theta$ (pose) and $\beta$ (shape) from multiple views, without requiring calibrated or fixed camera configurations. The approach achieves state-of-the-art accuracy, strong occlusion robustness, and impressive generalization across datasets and view configurations, demonstrated through extensive ablations and cross-domain tests. This work offers a practical foundation for passive, real-world human behavior modeling in environments with fixed camera deployments.

Abstract

We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

TL;DR

HeatFormer tackles the problem of accurate human mesh recovery from fixed multiview cameras by reframing SMPL parameter estimation as neural optimization. It introduces a heatmap-based representation and a Transformer-based HeatEncoder/Decoder that iteratively refines SMPL parameters (pose) and (shape) from multiple views, without requiring calibrated or fixed camera configurations. The approach achieves state-of-the-art accuracy, strong occlusion robustness, and impressive generalization across datasets and view configurations, demonstrated through extensive ablations and cross-domain tests. This work offers a practical foundation for passive, real-world human behavior modeling in environments with fixed camera deployments.

Abstract

We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.

Paper Structure

This paper contains 27 sections, 2 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: HeatFormer is a novel neural optimizer for human mesh recovery from static multiview images. It recovers the SMPL shape and pose parameters by fully leveraging the views to resolve occlusions (e.g., note how well the body parts occluded by objects and other body parts are recovered). It is also agnostic to the configuration and number of views which is essential for generalization in real-world scenes. (From left to right, the datasets are Human3.6M 6682899, BEHAVE bhatnagar22behave, MPI-INF-3DHP mono-3dhp2017, and RICH Huang:CVPR:2022.)
  • Figure 2: HeatFormer realizes neural optimization for HMR in which the Transformer encoder-decoder model serves as an unrolled iteration of SMPL fitting to the observed images. It first extracts image features and a heatmap for each view which are aggregated with a novel encoder and input to the decoder. The decoder also takes in heatmaps generated from the current SMPL estimate and, through its unrolled inference, iteratively aligns them together.
  • Figure 3: HeatFormer is an unrolled iterative optimizer realized through its forward inference. HeatFormer converges to accurate SMPL estimates within three unrolled inferences.
  • Figure 4: HeatFormer can reconstruct occluded body parts by referencing other views even though they are capturing the occluded body part from a different viewing direction. This complementary use of multiview images is the core strength of HeatFormer, which complements monocular methods like HMR2.0.
  • Figure A: Qualitative results on the Human3.6M 6682899 dataset. HeatFormer successfully leverages the multiview observations to resolve the complex occlusions.
  • ...and 4 more figures