Table of Contents
Fetching ...

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

Hanbing Liu, Wangmeng Xiang, Jun-Yan He, Zhi-Qi Cheng, Bin Luo, Yifeng Geng, Xuansong Xie

TL;DR

The paper tackles accurate 3D human pose estimation in video by introducing the RTPCA transformer, which enhances temporal modeling through the Temporal Pyramidal Compression-and-Amplification (TPCA) attention and reinforces inter-block information flow with a Cross-Layer Refinement (XLR) module. TPCA refines key and value representations via a multi-stage pyramidal process, enabling multi-scale temporal information extraction within attention, while XLRR promotes dynamic interaction between adjacent transformer blocks. The approach achieves state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP with minimal computational overhead and improved robustness to noise and occlusion. These findings suggest that structured intra-block temporal refinement combined with cross-layer inter-block fusion can significantly enhance transformer-based 3D pose estimation in real-world video settings.

Abstract

Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.

Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

TL;DR

The paper tackles accurate 3D human pose estimation in video by introducing the RTPCA transformer, which enhances temporal modeling through the Temporal Pyramidal Compression-and-Amplification (TPCA) attention and reinforces inter-block information flow with a Cross-Layer Refinement (XLR) module. TPCA refines key and value representations via a multi-stage pyramidal process, enabling multi-scale temporal information extraction within attention, while XLRR promotes dynamic interaction between adjacent transformer blocks. The approach achieves state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP with minimal computational overhead and improved robustness to noise and occlusion. These findings suggest that structured intra-block temporal refinement combined with cross-layer inter-block fusion can significantly enhance transformer-based 3D pose estimation in real-world video settings.

Abstract

Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.
Paper Structure (17 sections, 5 equations, 5 figures, 8 tables)

This paper contains 17 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison on the Human3.6M dataset to evaluate the performance of various methods in terms of MPJPE and latency. Here the performance closer to the origin of the coordinates is considered more optimal. It indicates that our method (i.e., RTPCA) surpasses the others in terms of both accuracy and efficiency.
  • Figure 2: Framework of the proposed Refined Temporal Pyramidal Compression-and-Amplification (RTPCA). The network is formed by stacking TPCA modules to extract multi-scale information in attention. The Cross-Layer Refinement (XLR) module is proposed to fuse inter-block information. The idea is to combine keys and values from both the front and the back for feature aggregation, thereby boosting the capability of the transformer. The input feature dimension equals $B\times J \times F \times C$, where $B$ denotes the batch size, $F$ is the number of frames, $J$ is the number of joints and $C$ represents the channel size.
  • Figure 3: Comparison of 3D estimated human pose generated by different methods. The 3D reconstruction visualization results using our proposed method, SOTA method MixSTE, ground truth, and the corresponding video frame in the Human3.6M dataset are shown in this figure. Our method shows higher accuracy and robustness in handling various actions and occlusion scenarios.
  • Figure 4: The MPJPE-Frame curves using MixSTE and our methods. The comparison of the proposed method and MixSTE on Human3.6M test set using frame-wise MPJPE for Photo action is conducted and our method outperforms MixSTE with higher accuracy and stability.
  • Figure 5: Attention Visualization for our method and MixSTE. The first row is the results of our method and the second row is the results of MixSTE. More comprehensive attention can be learned using our method which testifies the effectiveness to aggregate information. The original ST tends to focus on certain frames.