Table of Contents
Fetching ...

Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation

Kaleab A. Kinfu, René Vidal

TL;DR

The paper tackles the persistent trade-off among accuracy, efficiency, and robustness in 2D human pose estimation by introducing two Vision Transformer-based models, EViTPose and UniTransPose, and a unified skeletal representation for cross-dataset training. EViTPose achieves efficiency through learnable joint-token–driven patch selection, reducing computations by $30\%$ to $44\%$ with minimal accuracy loss ($0\%$ to $3.5\%$) across six benchmarks, while UniTransPose employs a multi-scale encoder with Joint Aware Global-Local (JAGL) attention and a sub-pixel CNN decoder to boost accuracy and speed, including noteworthy improvements of $0.9\%$ to $43.8\%$ across datasets. The unified skeletal representation enables training on multiple datasets with differing joint annotations, enhancing generalization and robustness to pose variations, occlusions, and lighting conditions. Collectively, the methods deliver state-of-the-art accuracy-efficiency-robustness trade-offs and offer flexible decoding options (heat-map and regression) to suit diverse application needs.

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have led to significant progress in 2D body pose estimation. However, achieving a good balance between accuracy, efficiency, and robustness remains a challenge. For instance, CNNs are computationally efficient but struggle with long-range dependencies, while ViTs excel in capturing such dependencies but suffer from quadratic computational complexity. This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. The first one, EViTPose, operates in a computationally efficient manner without sacrificing accuracy by utilizing learnable joint tokens to select and process a subset of the most important body patches, enabling us to control the trade-off between accuracy and efficiency by changing the number of patches to be processed. The second one, UniTransPose, while not allowing for the same level of direct control over the trade-off, efficiently handles multiple scales by combining (1) an efficient multi-scale transformer encoder that uses both local and global attention with (2) an efficient sub-pixel CNN decoder for better speed and accuracy. Moreover, by incorporating all joints from different benchmarks into a unified skeletal representation, we train robust methods that learn from multiple datasets simultaneously and perform well across a range of scenarios -- including pose variations, lighting conditions, and occlusions. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods while improving computational efficiency. EViTPose exhibits a significant decrease in computational complexity (30% to 44% less in GFLOPs) with a minimal drop of accuracy (0% to 3.5% less), and UniTransPose achieves accuracy improvements ranging from 0.9% to 43.8% across these benchmarks.

Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation

TL;DR

The paper tackles the persistent trade-off among accuracy, efficiency, and robustness in 2D human pose estimation by introducing two Vision Transformer-based models, EViTPose and UniTransPose, and a unified skeletal representation for cross-dataset training. EViTPose achieves efficiency through learnable joint-token–driven patch selection, reducing computations by to with minimal accuracy loss ( to ) across six benchmarks, while UniTransPose employs a multi-scale encoder with Joint Aware Global-Local (JAGL) attention and a sub-pixel CNN decoder to boost accuracy and speed, including noteworthy improvements of to across datasets. The unified skeletal representation enables training on multiple datasets with differing joint annotations, enhancing generalization and robustness to pose variations, occlusions, and lighting conditions. Collectively, the methods deliver state-of-the-art accuracy-efficiency-robustness trade-offs and offer flexible decoding options (heat-map and regression) to suit diverse application needs.

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have led to significant progress in 2D body pose estimation. However, achieving a good balance between accuracy, efficiency, and robustness remains a challenge. For instance, CNNs are computationally efficient but struggle with long-range dependencies, while ViTs excel in capturing such dependencies but suffer from quadratic computational complexity. This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. The first one, EViTPose, operates in a computationally efficient manner without sacrificing accuracy by utilizing learnable joint tokens to select and process a subset of the most important body patches, enabling us to control the trade-off between accuracy and efficiency by changing the number of patches to be processed. The second one, UniTransPose, while not allowing for the same level of direct control over the trade-off, efficiently handles multiple scales by combining (1) an efficient multi-scale transformer encoder that uses both local and global attention with (2) an efficient sub-pixel CNN decoder for better speed and accuracy. Moreover, by incorporating all joints from different benchmarks into a unified skeletal representation, we train robust methods that learn from multiple datasets simultaneously and perform well across a range of scenarios -- including pose variations, lighting conditions, and occlusions. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods while improving computational efficiency. EViTPose exhibits a significant decrease in computational complexity (30% to 44% less in GFLOPs) with a minimal drop of accuracy (0% to 3.5% less), and UniTransPose achieves accuracy improvements ranging from 0.9% to 43.8% across these benchmarks.

Paper Structure

This paper contains 33 sections, 17 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overall architecture of EViTPose: ViT-based human pose estimation method with patch selection -- An image is passed through a patch embedding layer to obtain patches of size $16\times 16$. These patches, along with $J$ learnable joint tokens, are processed by a ViT with $L$ transformer blocks. Utilizing the joint tokens, the patch selection module progressively selects patches that more likely contain the most important information about body joints across all blocks except the last one. The non-selected patches are not processed by the subsequent blocks but are utilized in the heat-map decoder. The output of the final ViT block is then used by a CNN-based heat-map decoder to estimate the heat map of $J$ joints, while a simple MLP joint regressor estimates joints directly from the joint tokens.
  • Figure 2: Overall architecture of , a multi-scale vision transformer based human pose estimation network -- An input image $X \in \mathbb{R}^{H\times W\times 3}$ is fed into a patch embedding layer that divides the image into patches of size $4\times 4$. A linear embedding layer then projects the patch tokens to a $C-$dimensional vector. The patch tokens along with joint tokens are processed by four stages. Each stage comprises Joint-Aware Global Local (JAGL) attention blocks, which consist of local patch-to-patch attention followed by global patch-to-joint, joint-to-joint, and joint-to-patch attention. A convolution layer ($3 \times 3$, stride $2$) is applied between stages to reduce the spatial resolution of the patch tokens and generate a hierarchical feature map. This operation also doubles the patch tokens' channel dimension. Consequently, to maintain consistency, a linear embedding layer is used to double the channel dimension of the joint tokens. The output of each stage is then passed to a CNN-based decoder to estimate the heat map of the $J$ joints. Meanwhile, the key-point regressor uses the joint tokens to directly estimate the $(x,y)$ locations of the $J$ joints.
  • Figure 3: Distinct annotation styles across multiple benchmarks. (a) COCO and OCHuman share a common 17-joint skeleton. (b) JRDB uses the same number of joints but differs in locations. (c) MPII employs 16 joints. (d) AIChallenger and CrowdPose use 14 joints. (e) The proposed Unified skeleton comprises all joints present in the various benchmarks.
  • Figure 4: Runtime (FPS) vs GFLOPs comparison -- The Joint-Token-based Patch Selection method (-B/JT) achieves an 88% reduction in GFLOPs and a 10$\times$ (955%) increase in FPS compared to ViTPose-H, with a minimal accuracy drop of up to 2.9%.
  • Figure 5: Trade-off between accuracy and GFLOPs on three benchmarks: COCO, MPII, and OCHuman -- The performance of -B with two patch selection methods: Neighbors (dashed line) and Joint-Token-based (solid line). $n$ denotes to the number of neighbors selected and $b,p$ refers to $p$ number of patches that are removed at block $b$ in the Joint Tokens method.
  • ...and 1 more figures