Table of Contents
Fetching ...

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Soroush Mehraban, Andrea Iaboni, Babak Taati

TL;DR

This work addresses the high computational cost of transformer-based HMR by introducing Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe) to reduce depth and token count, respectively. To offset potential accuracy loss from merging, it adds a diffusion-based decoder that leverages a learned VAE pose prior and temporal context, enabling anatomically plausible and temporally smooth mesh reconstructions. Empirical results show up to 2.3x speed-ups on standard GPUs with slight or improved accuracy, and ablations underscore the importance of the diffusion prior, training objectives, and background token handling. The approach offers a broadly deployable solution for real-time, temporally coherent 3D human mesh recovery without requiring specialized hardware or multi-sample inference.

Abstract

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

TL;DR

This work addresses the high computational cost of transformer-based HMR by introducing Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe) to reduce depth and token count, respectively. To offset potential accuracy loss from merging, it adds a diffusion-based decoder that leverages a learned VAE pose prior and temporal context, enabling anatomically plausible and temporally smooth mesh reconstructions. Empirical results show up to 2.3x speed-ups on standard GPUs with slight or improved accuracy, and ablations underscore the importance of the diffusion prior, training objectives, and background token handling. The approach offers a broadly deployable solution for real-time, temporally coherent 3D human mesh recovery without requiring specialized hardware or multi-sample inference.

Abstract

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

Paper Structure

This paper contains 16 sections, 6 equations, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: Throughput v.s. MPJPE error on EMDB benchmark. Throughput is evaluated on a single RTX 3090 GPU.
  • Figure 2: CKA (Center Kernel Alignment) between pairs of Transformer layers in CameraHMR patel2024camerahmr, and HMR2.0 goel2023humans.
  • Figure 3: Overview of the Mask-ToMe strategy. Tokens are split into sets $A$ and $B$, and the most similar background token pairs are merged using similarity scores while masking out person tokens. The bold and underlined numbers represent the highest and second-highest similarity scores, respectively. The numbers shown are illustrative examples only.
  • Figure 4: Diffusion Decoder Overview. (a) In first stage of training, a VAE model $\mathcal{V}$ is trained to learn human motion priors. (b) The second stage includes training of a denoiser $\epsilon_\theta$ to recover pose latents conditioned on per-frame encodings extracted from encoder $\mathcal{M}$.
  • Figure 5: Qualitative comparison of mesh reconstructions across FastHMR pipeline stages.
  • ...and 4 more figures