Multi-Grained Feature Pruning for Video-Based Human Pose Estimation
Zhigang Wang, Shaojing Fan, Zhenguang Liu, Zheqi Wu, Sifan Wu, Yingying Jiao
TL;DR
This paper tackles the inefficiency andLimited fine-grained perception in Transformer-based video pose estimation by introducing FTP-Pose, which combines a Multi-Grained Feature Encoder (MGFE) with a density peaks clustering-based feature token pruning strategy. The MGFE maintains a high-resolution branch for detailed spatial cues and a low-resolution branch for temporal dynamics, while the pruning process selects semantically informative tokens using local density $\rho_i$, distance $\delta_i$, and score $score_i=\rho_i\delta_i$, controlled by pruning ratio $\varepsilon$. The approach yields state-of-the-art results on PoseTrack datasets, e.g., achieving $87.4$ mAP on PoseTrack2017 and substantial speedups such as $93.8\%$ over baselines, demonstrating that discarding redundant tokens can simultaneously boost accuracy and efficiency. Overall, FTP-Pose provides a practical and effective framework for scalable video-based pose estimation with transformer architectures.
Abstract
Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.
