MEGA: Masked Generative Autoencoder for Human Mesh Recovery
Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer
TL;DR
This work tackles the ill-posed problem of recovering 3D human meshes from a single image by introducing MEGA, a masked generative autoencoder that tokenizes meshes into a discrete sequence of tokens and learns to generate them conditioned on image features. MEGA combines a self-supervised pretraining phase on motion-capture data with a supervised phase for RGB-HMR, enabling both deterministic single-output predictions and stochastic multi-output generations. The token-based formulation, coupled with an encoder–decoder Transformer and a Mesh-VQ-VAE, yields state-of-the-art performance in both deterministic and probabilistic settings on in-the-wild benchmarks, while also enabling unconditioned random mesh generation. The approach provides a flexible generation scheme, interpretable diversity via sampling, and practical benefits for applications requiring multiple plausible meshes or uncertainty quantification, with implications for animation, design, and healthcare use cases.
Abstract
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
