Table of Contents
Fetching ...

Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

Zenghao Zheng, Lianping Yang, Hegui Zhu, Mingrui Ye

TL;DR

This work tackles the high computational cost and frame redundancy of transformer-based monocular 3D human pose estimation by introducing the Spectral Compression Transformer (SCT), which compresses hidden features between transformer blocks using a DCT-based low-pass filter parameterized by $\sigma$. It also introduces the Line Pose Graph (LPG), which augments 2D pose priors with bone-centered coordinates derived from a line-graph formulation, enhancing topology-aware information. A dual-stream network architecture combines SCT and LPG in a way that progressively down-samples the temporal dimension while preserving the ability to recover full sequences via interpolation, achieving a reported MPJPE of $37.7$ mm on Human3.6M and strong results on MPI-INF-3DHP with reduced computational cost. The approach demonstrates that spectral compression of hidden features and bone-aware priors can substantially improve efficiency without sacrificing accuracy, and it remains compatible with other 3D HPE backbones.

Abstract

Transformer-based 3D human pose estimation methods suffer from high computational costs due to the quadratic complexity of self-attention with respect to sequence length. Additionally, pose sequences often contain significant redundancy between frames. However, recent methods typically fail to improve model capacity while effectively eliminating sequence redundancy. In this work, we introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation. The SCT encoder treats hidden features between blocks as Temporal Feature Signals (TFS) and applies the Discrete Cosine Transform, a Fourier transform-based technique, to determine the spectral components to be retained. By filtering out certain high-frequency noise components, SCT compresses the sequence length and reduces redundancy. To further enrich the input sequence with prior structural information, we propose the Line Pose Graph (LPG) based on line graph theory. The LPG generates skeletal position information that complements the input 2D joint positions, thereby improving the model's performance. Finally, we design a dual-stream network architecture to effectively model spatial joint relationships and the compressed motion trajectory within the pose sequence. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our model achieves state-of-the-art performance with improved computational efficiency. For example, on the Human3.6M dataset, our method achieves an MPJPE of 37.7mm while maintaining a low computational cost. Furthermore, we perform ablation studies on each module to assess its effectiveness. The code and models will be released.

Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

TL;DR

This work tackles the high computational cost and frame redundancy of transformer-based monocular 3D human pose estimation by introducing the Spectral Compression Transformer (SCT), which compresses hidden features between transformer blocks using a DCT-based low-pass filter parameterized by . It also introduces the Line Pose Graph (LPG), which augments 2D pose priors with bone-centered coordinates derived from a line-graph formulation, enhancing topology-aware information. A dual-stream network architecture combines SCT and LPG in a way that progressively down-samples the temporal dimension while preserving the ability to recover full sequences via interpolation, achieving a reported MPJPE of mm on Human3.6M and strong results on MPI-INF-3DHP with reduced computational cost. The approach demonstrates that spectral compression of hidden features and bone-aware priors can substantially improve efficiency without sacrificing accuracy, and it remains compatible with other 3D HPE backbones.

Abstract

Transformer-based 3D human pose estimation methods suffer from high computational costs due to the quadratic complexity of self-attention with respect to sequence length. Additionally, pose sequences often contain significant redundancy between frames. However, recent methods typically fail to improve model capacity while effectively eliminating sequence redundancy. In this work, we introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation. The SCT encoder treats hidden features between blocks as Temporal Feature Signals (TFS) and applies the Discrete Cosine Transform, a Fourier transform-based technique, to determine the spectral components to be retained. By filtering out certain high-frequency noise components, SCT compresses the sequence length and reduces redundancy. To further enrich the input sequence with prior structural information, we propose the Line Pose Graph (LPG) based on line graph theory. The LPG generates skeletal position information that complements the input 2D joint positions, thereby improving the model's performance. Finally, we design a dual-stream network architecture to effectively model spatial joint relationships and the compressed motion trajectory within the pose sequence. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our model achieves state-of-the-art performance with improved computational efficiency. For example, on the Human3.6M dataset, our method achieves an MPJPE of 37.7mm while maintaining a low computational cost. Furthermore, we perform ablation studies on each module to assess its effectiveness. The code and models will be released.

Paper Structure

This paper contains 24 sections, 18 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: This process involves compressing and upsampling the hidden feature sequence using SCT. Here, $F$ denotes the original sequence frame length. The proposed dual-stream structure progressively downsamples the hidden feature sequence in length while upsampling the output of each layer to reconstruct the hidden pose sequence. The reconstructed sequences from all layers are aggregated to produce the final hidden pose sequence.
  • Figure 2: TFS and its associated frequency-domain plots with MixSTE's third block. For subfigures (a) and (d), in the time domain, the horizontal axis represents time, while the vertical axis denotes the signal's magnitude. For the frequency-domain signals in subfigures (b) and (c), the horizontal axis shows the frequency components of the TFS, and the vertical axis displays the power spectral density of the signal at each frequency. Upon examining the images, it becomes evident that the trends of the signals remain largely consistent before and after truncation, with a reduction observed in some jagged noise components.
  • Figure 3:
  • Figure 4: A case of transformation from an original graph $G$ to its line graph $L(G)$.
  • Figure 5: Illustration of the transformation from the original pose to the line pose. Yellow dots represent joints, and yellow lines connecting these joints represent bones. Post-transformation, the red dots represent the midpoints of the bones, and in the line pose, there are edges between the vertices corresponding to bones that share a common joint in the original pose.
  • ...and 6 more figures