Table of Contents
Fetching ...

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

Zhoujie Xu

TL;DR

The paper tackles medium- and small-scale human pose estimation by marrying Vision Transformer backbones with CNN-inspired high-resolution processing. It introduces HRPVT, built on PVT v2 and SimCC, and a High-Resolution Pyramid Module (HRPM) to inject scale-invariance and locality into high-resolution maps. Two insertion strategies (Layer-wise and Stage-wise) adapt HRPM to baselines of varying capacity, achieving superior accuracy with substantially fewer parameters and GFLOPs on MS COCO and competitive results on MPII. The study demonstrates the practical impact of combining transformer power with CNN inductive biases for efficient, accurate pose estimation in challenging, small-scale scenarios.

Abstract

Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the high-resolution feature maps. The integration of HRPM enhances the performance of pure transformer-based models for human pose estimation at medium and small scales. Furthermore, we replace the heatmap-based method with SimCC approach, which eliminates the need for costly upsampling layers, thereby allowing us to allocate more computational resources to HRPM. To accommodate models with varying parameter scales, we have developed two insertion strategies of HRPM, each designed to enhancing the model's ability to perceive medium and small-scale human poses from two distinct perspectives.

HRPVT: High-Resolution Pyramid Vision Transformer for medium and small-scale human pose estimation

TL;DR

The paper tackles medium- and small-scale human pose estimation by marrying Vision Transformer backbones with CNN-inspired high-resolution processing. It introduces HRPVT, built on PVT v2 and SimCC, and a High-Resolution Pyramid Module (HRPM) to inject scale-invariance and locality into high-resolution maps. Two insertion strategies (Layer-wise and Stage-wise) adapt HRPM to baselines of varying capacity, achieving superior accuracy with substantially fewer parameters and GFLOPs on MS COCO and competitive results on MPII. The study demonstrates the practical impact of combining transformer power with CNN inductive biases for efficient, accurate pose estimation in challenging, small-scale scenarios.

Abstract

Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the high-resolution feature maps. The integration of HRPM enhances the performance of pure transformer-based models for human pose estimation at medium and small scales. Furthermore, we replace the heatmap-based method with SimCC approach, which eliminates the need for costly upsampling layers, thereby allowing us to allocate more computational resources to HRPM. To accommodate models with varying parameter scales, we have developed two insertion strategies of HRPM, each designed to enhancing the model's ability to perceive medium and small-scale human poses from two distinct perspectives.

Paper Structure

This paper contains 29 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The pipeline of HRPVT. Given an input image, a human detector is first applied to generate a set of human bounding boxes. Subsequently, the backbone of HRPVT is used to extract keypoint representations of the human body. Finally, the 1D Coordinate Classifier, SimCC, predicts the detailed localization of keypoints for each individual.
  • Figure 2: Illustration of the structure of HRPM v1: In the stem-net, given a cropped image of size H×W×3, hierarchical hybrid-dilated convolutions are first applied to obtain six feature maps with different receptive fields. These feature maps are then stacked through a concatenation operation to form a high-resolution feature pyramid of size H/2×W/2×96. Subsequently, a strided convolutional layer is used to fuse and compress the features while downsampling the feature maps to reduce computational load in subsequent stages. Finally, the feature maps are reshaped into a token sequence to serve as input for the following stages
  • Figure 3: Illustration of the structure of HRPM v2: In stage 2, after passing through all the PVT v2 encoder layers, the token sequence is first reshaped back into a feature map of size H/4×W/4×$C_1$, where $C_1$ is the channel number of the first stage. This feature map is then upsampled to H/2×W/2×$C_1$/2 using a deconvolutional layer. Next, it passes through a hierarchical hybrid-dilated convolutions structure with a depth of 3, and the high-resolution features are aggregated using an element-wise addition operation to model the high-resolution pyramid structure again. Finally, a strided convolutional layer downsamples the feature map back to its original size. Additionally, throughout the entire process, a residual branch is utilized to reuse the learned features from the previous stage.
  • Figure 4: Illustration of the three insertion methods for HRPM v2: Vanilla Insertion involves inserting only after the first stage. Layer-wise Insertion refers to inserting after each PVT v2 encoder layer within the first stage. Stage-wise Insertion involves inserting after each stage.
  • Figure 5: Ablation study on three insertion methods and their variants. 'w/o' indicates the absence of HRPM, 'w/vanilla' refers to using Vanilla Insertion, 'w/layer-wise' and 'w/stage-wise' represent the use of the Layer-wise Insertion strategy and Stage-wise Insertion strategy respectively, and 'variant' denotes a variant of one of these two insertion strategies.
  • ...and 1 more figures