Table of Contents
Fetching ...

LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion

Donghwan Kim, Tae-Kyun Kim

TL;DR

LieHMR tackles monocular human mesh recovery by learning an image-conditioned distribution over $SO(3)$ pose parameters using a diffusion model. The architecture disentangles a time-independent transformer that captures joint relationships from a per-joint, time-dependent denoiser that operates on $SO(3)$, enabling both image-conditioned and unconditional generation. Trained with a hybrid supervised/self-supervised strategy, LieHMR achieves strong single-output performance and diverse multi-output samples, surpassing several probabilistic baselines and competing with state-of-the-art deterministic methods. The approach demonstrates robust generation under occlusion and depth ambiguity, with practical implications for realistic human motion modeling in vision and graphics, while acknowledging the diffusion-based inference cost and potential for further acceleration and multimodal extensions.

Abstract

We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce $SO(3)$ diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.

LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion

TL;DR

LieHMR tackles monocular human mesh recovery by learning an image-conditioned distribution over pose parameters using a diffusion model. The architecture disentangles a time-independent transformer that captures joint relationships from a per-joint, time-dependent denoiser that operates on , enabling both image-conditioned and unconditional generation. Trained with a hybrid supervised/self-supervised strategy, LieHMR achieves strong single-output performance and diverse multi-output samples, surpassing several probabilistic baselines and competing with state-of-the-art deterministic methods. The approach demonstrates robust generation under occlusion and depth ambiguity, with practical implications for realistic human motion modeling in vision and graphics, while acknowledging the diffusion-based inference cost and potential for further acceleration and multimodal extensions.

Abstract

We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.

Paper Structure

This paper contains 34 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Well-aligned probability distribution modeling to 2D observations in HMR. We propose LieHMR, which simultaneously learns image-conditioned and unconditional human pose and shape generation. Our goal is modeling well-aligned distribution to 2D observations. (a) Given an image with less ambiguity, we can sample as accurate single output as prior deterministic HMR methods. (b) Given an ambiguous image, we can sample multiple plausible outputs. The diversity is mainly related to the depth ambiguity or occlusion. (c) LieHMR can also generate human pose and shape in unconditional manner, which states that the model learns the pose priors well.
  • Figure 2: (Left) We modify LieHMR based on DiT Peebles2022DiT. Here, the denoising model consists of the transformer and learns the joint probability distribution of whole tokens. We concatenate the image features to the input sequence. (Right) This is a brief overview for Vector-Quantization based methods dwivedi_cvpr2024_tokenhmrfiche2024vqfiche2024mega. They predicts the quantized tokens conditioned on the image features from a fully masked sequence and reconstruct the pose parameters or 3D mesh with VQ-VAE decoder.
  • Figure 3: Overview of LieHMR. (\ref{['subsec:sequence']}) Given the partially visible pose tokens, the transformer-based sequence model extracts latent vectors $z$ for all tokens. Following the pipeline of Masked Autoencoder (MAE), we apply the encoder on the visible tokens and fill the mask tokens before the decoder. (\ref{['subsec:so3_diffusion']}) Given the noised pose token $\theta_t^i$, the MLP-based denoising model predicts the noise conditioned on the latent vector $z^i$. We perform the denoising process independently for each token. The forward and reverse process of diffusion procedure are constrained on $SO(3)$ manifold. We optionally concatenate the image features for image-conditioned generation.
  • Figure 4: Ablation study on image-conditioned generation. We plot the speed/MPJPE trade off on 10% subset of 3DPW and EMDB dataset in single-output setting. The curves are obtained by different diffusion timesteps (75, 150, 250, 500, and 1,000).
  • Figure 5: Ablation study on unconditional generation. We plot the speed/FID and speed/APD trade off. The curves are obtained by different diffusion timesteps (75, 150, 250, 500, and 1,000) and autoregressive steps (1, 3, and 6).
  • ...and 8 more figures