Table of Contents
Fetching ...

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, Nicu Sebe

TL;DR

The paper tackles the heavy computation of video pose transformers used for 3D human pose estimation by introducing the Hourglass Tokenizer (HoT), which prunes pose tokens from redundant frames and later recovers full-length tokens to maintain complete temporal coverage. HoT comprises two core components: Token Pruning Cluster (TPC) that dynamically selects a small, semantically diverse set of representative tokens, and Token Recovering Attention (TRA) that reconstructs full temporal resolution from these tokens, enabling fast seq2seq and seq2frame inference when plugged into existing VPTs such as MHFormer, MixSTE, and MotionBERT. Across Human3.6M and MPI-INF-3DHP, HoT achieves substantial FLOPs reductions (approximately 40–50%) with only minor or negligible drops in accuracy, demonstrating strong efficiency gains with broad compatibility. The framework is designed as a general, plug-and-play solution, highlighting significant practical impact for deploying transformer-based 3D HPE on devices with limited compute resources, while preserving or even enhancing performance in many settings.

Abstract

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

TL;DR

The paper tackles the heavy computation of video pose transformers used for 3D human pose estimation by introducing the Hourglass Tokenizer (HoT), which prunes pose tokens from redundant frames and later recovers full-length tokens to maintain complete temporal coverage. HoT comprises two core components: Token Pruning Cluster (TPC) that dynamically selects a small, semantically diverse set of representative tokens, and Token Recovering Attention (TRA) that reconstructs full temporal resolution from these tokens, enabling fast seq2seq and seq2frame inference when plugged into existing VPTs such as MHFormer, MixSTE, and MotionBERT. Across Human3.6M and MPI-INF-3DHP, HoT achieves substantial FLOPs reductions (approximately 40–50%) with only minor or negligible drops in accuracy, demonstrating strong efficiency gains with broad compatibility. The framework is designed as a general, plug-and-play solution, highlighting significant practical impact for deploying transformer-based 3D HPE on devices with limited compute resources, while preserving or even enhancing performance in many settings.

Abstract

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.
Paper Structure (18 sections, 4 equations, 14 figures, 13 tables)

This paper contains 18 sections, 4 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: FLOPs and estimation errors (MPJPE, lower is better) of different VPTs on Human3.6M dataset. We achieve highly competitive or even better results while saving FLOPs.
  • Figure 2: (a) Existing VPTs follow a "rectangle" paradigm that retains the full-length sequence across all blocks, which incurs expensive and redundant computational costs. (b) Instead, our HoT follows an "hourglass" paradigm that prunes the pose tokens and recovers the full-length tokens, which keeps a few tokens in the intermediate transformer blocks and thus improves the model efficiency. The gray squares represent the pruned tokens.
  • Figure 3: Overview of the proposed Hourglass Tokenizer (HoT). It mainly consists of a token pruning cluster (TPC) module and a token recovering attention (TRA) module. TPC selects the pose tokens of representative frames after the first few transformer blocks and TRA recovers the full-length tokens after the last transformer block.
  • Figure 4: Illustration of our token pruning cluster (TPC) architecture. Given the input pose tokens, we pool them in the spatial dimension, cluster the input tokens into several groups according to the feature similarity of the resulting pooled tokens, and select the cluster centers as the representative tokens.
  • Figure 5: Illustration of our token recovering attention (TRA) architecture. TRA takes the representative tokens of the last transformer block, along with learnable tokens that are initialized to zero, as input to recover the full-length tokens.
  • ...and 9 more figures