GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Haonan Wang; Jie Liu; Jie Tang; Gangshan Wu; Bo Xu; Yanbing Chou; Yong Wang

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang

TL;DR

GTPT addresses the challenge of efficient whole-body 2D HPE by introducing a coarse-to-fine keypoint expansion and group-based pruning within a Transformer framework. It employs a Tokenizer, Coarse Encoder, Coarse-to-Fine Module, and Fine Encoder, with Multi-Head Group Attention to enable inter-group interaction while keeping computation low. The method pairs group-specific pruning with a Global Perceived Loss and curriculum learning to maintain performance under pruning and with many keypoints. Across COCO and COCO-WholeBody, GTPT achieves superior efficiency-accuracy trade-offs, outperforming several state-of-the-art methods at comparable FLOPs and demonstrating strong practicality for industrial deployment.

Abstract

In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 10 figures, 8 tables)

This paper contains 22 sections, 9 equations, 10 figures, 8 tables.

Introduction
Related Work
Transformer in HPE
Efficient HPE
Method
Tokenizer
Coarse Encoder
Coarse-to-Fine Transition
Fine Encoder
Global Perceived Pruning
Experiments
Settings
Results
Ablation Study
Limitations and Future Work
...and 7 more sections

Figures (10)

Figure 1: Comparison of our method with SOTA methods on COCO val. The horizontal coordinate indicates computation, the vertical coordinate indicates precision, and the circle size indicates the model parameters.
Figure 2: Overview of the introduction and grouping of keypoints. GTPT introduces keypoints in a coarse-to-fine manner. It starts with a human token and gradually transitions to sparse keypoint tokens and part tokens. Eventually, it converts part tokens into corresponding dense keypoint tokens. Besides, we categorize all keypoints into three groups: head, upper body, and lower body.
Figure 3: Overview of our proposed architecture. First, the tokenizer, which consists of a shallow CNN, performs feature extraction on the input image and transforms it into a sequence of tokens. Next, the Coarse Encoder, made of Transformer layers, gradually performs feature extraction from the target human to sparse keypoints as the network depth increases. The Coarse-to-Fine Module is the transition between the Coarse Encoder and the Fine Encoder, which introduces dense keypoints and groups the keypoints. To extract better features for each group, it masks the visual tokens differently in different groups. We then perform feature extraction on the keypoints of each group using the Fine Encoder. Finally, all keypoint tokens are fed into the unified MLP Head to estimate the 1D heatmaps of each keypoint.
Figure 4: An example of the interaction between two groups via MHGA, where different patterns indicate different groups and red represents sharing.
Figure 5: Overview of the pruning process, where different patterns represent different visual tokens. The red indicates high scores, and the blue indicates low scores.
...and 5 more figures

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

TL;DR

Abstract

GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)