Table of Contents
Fetching ...

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

TL;DR

This work tackles the ill-posed problem of recovering 3D human meshes from a single image by introducing MEGA, a masked generative autoencoder that tokenizes meshes into a discrete sequence of tokens and learns to generate them conditioned on image features. MEGA combines a self-supervised pretraining phase on motion-capture data with a supervised phase for RGB-HMR, enabling both deterministic single-output predictions and stochastic multi-output generations. The token-based formulation, coupled with an encoder–decoder Transformer and a Mesh-VQ-VAE, yields state-of-the-art performance in both deterministic and probabilistic settings on in-the-wild benchmarks, while also enabling unconditioned random mesh generation. The approach provides a flexible generation scheme, interpretable diversity via sampling, and practical benefits for applications requiring multiple plausible meshes or uncertainty quantification, with implications for animation, design, and healthcare use cases.

Abstract

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

TL;DR

This work tackles the ill-posed problem of recovering 3D human meshes from a single image by introducing MEGA, a masked generative autoencoder that tokenizes meshes into a discrete sequence of tokens and learns to generate them conditioned on image features. MEGA combines a self-supervised pretraining phase on motion-capture data with a supervised phase for RGB-HMR, enabling both deterministic single-output predictions and stochastic multi-output generations. The token-based formulation, coupled with an encoder–decoder Transformer and a Mesh-VQ-VAE, yields state-of-the-art performance in both deterministic and probabilistic settings on in-the-wild benchmarks, while also enabling unconditioned random mesh generation. The approach provides a flexible generation scheme, interpretable diversity via sampling, and practical benefits for applications requiring multiple plausible meshes or uncertainty quantification, with implications for animation, design, and healthcare use cases.

Abstract

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
Paper Structure (21 sections, 10 figures, 4 tables)

This paper contains 21 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Human mesh recovery from a single image is an ill-posed problem due to depth ambiguity. Probabilistic approaches have aimed to address this by generating multiple predictions, but diversity often sacrifices accuracy. Introducing MEGA, our HMR model based on masked generative modeling achieves state-of-the-art performance on in-the-wild benchmarks in single- and multi-output settings. Given a single image, MEGA can make predictions that all look accurate given the 2D cues but correspond to diverse 3D interpretations.
  • Figure 2: MEGA is a masked generative model based on an encoder-decoder Transformer architecture. During the self-supervised pretraining stage, MEGA is trained to predict human mesh tokens from partially visible inputs using motion capture data without paired image data. During the supervised training stage for HMR, the model is trained to predict randomly masked human mesh tokens conditioned on image embeddings. For both training stages, only the cross-entropy loss is used on the predicted mesh tokens. At test time, in stochastic inference mode, we start from a fully masked sequence of tokens and iteratively sample human mesh tokens conditioned on input image embeddings. In deterministic inference mode, we predict all tokens in a single forward pass.
  • Figure 3: Prediction process iterations. We visualize the predictions for intermediate steps in stochastic mode. All masked tokens are replaced by the first token of the codebook, corresponding to index 0.
  • Figure 4: Qualitative samples. Given a single image with occlusions, MEGA makes diverse plausible predictions.
  • Figure 5: Error distribution. We visualize the distribution of the MPJPE in mm on the 3DPW dataset.
  • ...and 5 more figures