Table of Contents
Fetching ...

TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

Jiajie Liu, Mengyuan Liu, Hong Liu, Wenhao Li

TL;DR

TCPFormer tackles the challenge of modeling complex temporal correlations in 2D pose sequences for 3D human pose estimation by introducing an implicit pose proxy and three interaction modules—PUM, PIM, and PAM—that iteratively refine and fuse proxy-driven representations with the pose sequence. The approach enables learning more comprehensive temporal dynamics and achieves state-of-the-art results on benchmark datasets such as Human3.6M and MPI-INF-3DHP, with ablations validating the contribution of each component and the proxy length. The method employs a spatio-temporal encoder, cross-attention-based proxy interactions, and a regression head with a combined $L = L_{3D} + \lambda L_T$ loss to enforce both accuracy and temporal smoothness. Overall, TCPFormer offers a scalable, transformer-based framework that improves temporal modeling in 3D pose lifting and demonstrates strong practical impact for accurate, temporally coherent pose estimation.

Abstract

Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.

TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation

TL;DR

TCPFormer tackles the challenge of modeling complex temporal correlations in 2D pose sequences for 3D human pose estimation by introducing an implicit pose proxy and three interaction modules—PUM, PIM, and PAM—that iteratively refine and fuse proxy-driven representations with the pose sequence. The approach enables learning more comprehensive temporal dynamics and achieves state-of-the-art results on benchmark datasets such as Human3.6M and MPI-INF-3DHP, with ablations validating the contribution of each component and the proxy length. The method employs a spatio-temporal encoder, cross-attention-based proxy interactions, and a regression head with a combined loss to enforce both accuracy and temporal smoothness. Overall, TCPFormer offers a scalable, transformer-based framework that improves temporal modeling in 3D pose lifting and demonstrates strong practical impact for accurate, temporally coherent pose estimation.

Abstract

Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
Paper Structure (17 sections, 13 equations, 5 figures, 7 tables)

This paper contains 17 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An illustration of our motivation. Given a pose sequence of length T, we take the individual pose within the pose sequence as an example. (a) In previous methods, one pose establishes the temporal correlation with the pose sequence only in one 1-to-T mapping. (b) We introduce an implicit pose proxy to act as an intermediate representation. Each proxy within the implicit pose proxy of length L can establish one 1-to-T mapping, which facilitates learning more comprehensive temporal correlation.
  • Figure 2: Overview of our method. We first extract the spatio-temporal information through a spatio-temporal encoder. Then, we introduce an implicit pose proxy which is initialized by Gaussian distribution. These features and proxy are then handed to the proxy update module to update the implicit pose proxy. Next, the proxy invocation module uses the updated pose proxy to enhance the feature of the pose sequence. We obtain an aggregation attention matrix through two cross attention matrices and send it with the pose sequence feature to the proxy attention module to learn comprehensive temporal correlation. After repeating the above processes N times, we use a regression head to obtain the 3D pose sequence.
  • Figure 3: Visualizations of different attention matrices. The first row is the original self-attention matrix. The second row is the aggregation attention matrix. The third row is our proxy attention matrix. As expected, our proxy attention matrix effectively leverages the aggregation attention matrix to complement the missing parts of the original self attention matrix.
  • Figure 4: Qualitative comparisons of our TCPFormer with MotionBERT on in-the-wild videos. The yellow arrows indicate locations where our method achieves better results.
  • Figure 5: Qualitative comparisons of our TCPformer with MotionBERT on Human3.6M. The green circles indicate locations where our method achieves better results.