SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

Xiaoqi An; Lin Zhao; Chen Gong; Nannan Wang; Di Wang; Jian Yang

SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

Xiaoqi An, Lin Zhao, Chen Gong, Nannan Wang, Di Wang, Jian Yang

TL;DR

SHaRPose introduces a sparse high-resolution representation for human pose estimation, reducing computational burden by focusing high-detail processing only on regions relevant to keypoints. The method uses a two-stage dynamic Transformer with a shared keypoint decoder: a coarse stage gathers region-keypoint relations and outputs coarse heatmaps, followed by a quality predictor that decides refinement; the fine stage constructs sparse high-resolution representations for selected patches and produces refined pose estimates. The approach achieves competitive COCO performance (e.g., 77.4 AP on val, 76.7 AP on test-dev for SHaRPose-Base) with substantially higher throughput (≈1.4x faster than ViTPose-Base) and reduced GFLOPs (≈25% less). Ablation studies confirm the effectiveness of the coarse-to-fine design and the importance of the alpha sparsity parameter and the quality predictor, with visualizations showing attention focusing on keypoint regions.

Abstract

High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of $1.4\times$ faster than ViTPose-Base.

SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

TL;DR

Abstract

faster than ViTPose-Base.

Paper Structure (32 sections, 11 equations, 6 figures, 8 tables)

This paper contains 32 sections, 11 equations, 6 figures, 8 tables.

Introduction
Related Works
Vision Transformer for Pose Estimation
Dynamic Vision Transformer
Method
Overall structure
Coarse-inference stage
Token Input
Transformer encoder
Keypoint Decoder
Quality Predictor
Fine-inference stage
Keypoint-Related Patch Recognition
Fine inference
Loss Function
...and 17 more sections

Figures (6)

Figure 1: A brief view of SHaRPose. The coarse stage selects image parts contributed to the keypoints, and the fine stage builds high-resolution representations upon them.
Figure 2: Decoder's response of ViTPose. Each heatmap is generated by feeding the output of each intermediate Transformer layer to the heatmap decoder.
Figure 3: The overall structure of SHaRPose. The attention maps yielded by the transformer in the coarse stage is used for selecting keypoint-related patches in the fine stage. Only these keypoint-related patches are processed in finer granularity in the fine stage. The parameters of the Transformer blocks and the keypoint decoder are shared between the two stages.
Figure 4: Compose the input of the fine stage. The attention scores $\hat{\mathbf{A}}_{h;k}$ between visual tokens and keypoint tokens are just part of the full attention matrix $\mathbf{A}_{h;k}$. Only high-score image patches (blue) are further split into fine-grained patches. An MLP is applied to incorporate the coarse-stage information into the fine stage.
Figure 5: Visualization of keypoint-related regions. Three samples are chosen as examples. The first column gives the input image, the second column presents the accumulated attention map, and the third column shows the selected image regions.
...and 1 more figures

SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

TL;DR

Abstract

SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)