Table of Contents
Fetching ...

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Antonio Agudo, Francesc Moreno-Noguer

TL;DR

VQ-HPS reframes HPSE from RGB images as a classification task in a vector-quantized mesh latent space, using Mesh-VQ-VAE to encode a canonical SMPL mesh into discrete indices. A Transformer-based encoder–decoder predicts these indices from image features, which are then decoded into a full mesh and oriented via predicted rotation and camera parameters. Pretraining the Mesh-VQ-VAE on AMASS and freezing it during VQ-HPS training provides regularization, enabling strong performance with limited data and competitive results on large-scale datasets. The approach achieves state-of-the-art results in scarce-data regimes on 3DPW and EMDB, demonstrating the data efficiency and robustness of a discrete latent mesh representation for HPSE.

Abstract

Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with little data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE. See the project page at https://g-fiche.github.io/research-pages/vqhps/

VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

TL;DR

VQ-HPS reframes HPSE from RGB images as a classification task in a vector-quantized mesh latent space, using Mesh-VQ-VAE to encode a canonical SMPL mesh into discrete indices. A Transformer-based encoder–decoder predicts these indices from image features, which are then decoded into a full mesh and oriented via predicted rotation and camera parameters. Pretraining the Mesh-VQ-VAE on AMASS and freezing it during VQ-HPS training provides regularization, enabling strong performance with limited data and competitive results on large-scale datasets. The approach achieves state-of-the-art results in scarce-data regimes on 3DPW and EMDB, demonstrating the data efficiency and robustness of a discrete latent mesh representation for HPSE.

Abstract

Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with little data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE. See the project page at https://g-fiche.github.io/research-pages/vqhps/
Paper Structure (23 sections, 1 equation, 14 figures, 5 tables)

This paper contains 23 sections, 1 equation, 14 figures, 5 tables.

Figures (14)

  • Figure 1: VQ-HPS formulates the human pose and shape estimation problem as a classification task in a vector-quantized latent space. We present the results of VQ-HPS on two challenging scenarios with in-the-wild conditions and poor illumination, comparing its performance to that of HMR hmrKanazawa17, CLIFF li2022cliff and FastMETRO-S cho2022FastMETRO when trained on little data.
  • Figure 1: Additional comparisons. We compare our method with HMR, CLIFF, and FastMETRO-S on 3DPW trained on 3DPW (first row) and EMDB trained on EMDB.
  • Figure 2: VQ-HPS global process for predicting the mesh given an image. We first predict the camera $\hat{\pi}$ and the rotation $\hat{R}$ from the image $I$. Then, we use the image, the predicted rotation, and the camera to predict the vertices $\hat{V}_c$ of the canonical mesh. Finally, $\hat{V}_c$ is rotated according to $\hat{R}$ to obtain the final mesh vertices $\hat{V}$.
  • Figure 2: Qualitative results. We visualize results obtained with VQ-HPS on the 3DPW dataset.
  • Figure 3: Mesh-VQ-VAE reconstruction error. Samples of reconstruction on the 3DPW test set. The error is in cm and corresponds to the Euclidean distance between the reconstruction's original mesh and the corresponding vertex.
  • ...and 9 more figures