Table of Contents
Fetching ...

Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

Zhigang Wang, Shaojing Fan, Zhenguang Liu, Zheqi Wu, Sifan Wu, Yingying Jiao

TL;DR

This paper tackles the inefficiency andLimited fine-grained perception in Transformer-based video pose estimation by introducing FTP-Pose, which combines a Multi-Grained Feature Encoder (MGFE) with a density peaks clustering-based feature token pruning strategy. The MGFE maintains a high-resolution branch for detailed spatial cues and a low-resolution branch for temporal dynamics, while the pruning process selects semantically informative tokens using local density $\rho_i$, distance $\delta_i$, and score $score_i=\rho_i\delta_i$, controlled by pruning ratio $\varepsilon$. The approach yields state-of-the-art results on PoseTrack datasets, e.g., achieving $87.4$ mAP on PoseTrack2017 and substantial speedups such as $93.8\%$ over baselines, demonstrating that discarding redundant tokens can simultaneously boost accuracy and efficiency. Overall, FTP-Pose provides a practical and effective framework for scalable video-based pose estimation with transformer architectures.

Abstract

Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.

Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

TL;DR

This paper tackles the inefficiency andLimited fine-grained perception in Transformer-based video pose estimation by introducing FTP-Pose, which combines a Multi-Grained Feature Encoder (MGFE) with a density peaks clustering-based feature token pruning strategy. The MGFE maintains a high-resolution branch for detailed spatial cues and a low-resolution branch for temporal dynamics, while the pruning process selects semantically informative tokens using local density , distance , and score , controlled by pruning ratio . The approach yields state-of-the-art results on PoseTrack datasets, e.g., achieving mAP on PoseTrack2017 and substantial speedups such as over baselines, demonstrating that discarding redundant tokens can simultaneously boost accuracy and efficiency. Overall, FTP-Pose provides a practical and effective framework for scalable video-based pose estimation with transformer architectures.

Abstract

Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.

Paper Structure

This paper contains 11 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall pipeline of our FTP-Pose framework. Given an input sequence $\{I_{t-1}^{i}, I_{t}^{i}, I_{t+1}^{i}\}$, our goal is to generate the pose heatmap $\mathbf{H}_{t}^{i}$ of the key frame $I_{t}^{i}$. Initially, we extract the feature tokens $\{\boldsymbol{F}_{t-1}^{i}, \boldsymbol{F}_{t}^{i}, \boldsymbol{F}_{t+1}^{i}\}$ via a ViT backbone. We then feed these features into the Multi-Grained Feature Encoder (MGFE) to manage fine-grained spatial dependencies and capture high-dimensional temporal contexts. Subsequently, the output features derived from MGFE are combined through a cross-attention layer. Finally, these combined features are processed through a specific pose head to estimate the pose heatmaps $\mathbf{H}_{t}^{i}$.
  • Figure 2: Qualitative comparison of our FTP-Pose, DCPose liu2021dcpose, and TDMI feng2023tdmi on the PoseTrack dataset, featuring challenges such as pose occlusions, fast motion, and video defocus. Red solid circles denote the inaccurate pose results.