KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Jihua Peng; Yanghong Zhou; P. Y. Mok

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Jihua Peng, Yanghong Zhou, P. Y. Mok

TL;DR

KTPFormer introduces two prior-attention modules, KPA and TPA, to inject kinematic and trajectory priors into transformer-based 3D human pose estimation. KPA fuses a learnable kinematic topology A_K, constructed from fixed skeletal links and a learnable global affinity, into spatial token representations via H_{TN} = (M_N bar{P}_{TN}) A_K, while TPA encodes joint-motion trajectories across frames through a learnable trajectory topology A_R, yielding H_{NT} used in the temporal MHSA. The architecture stacks spatio-temporal encoders and employs a regression head to predict 3D poses, with a loss combining weighted MPJPE, temporal consistency, and velocity terms. Across Human3.6M, MPI-INF-3DHP, and HumanEva, KTPFormer achieves state-of-the-art results with only modest increases in parameters and FLOPs, and its KPA/TPA modules are lightweight and readily transferable to other transformer-based 3D pose estimators, including diffusion-based models. This approach offers significant practical impact by enhancing spatial-temporal modeling in 3D pose estimation while preserving model simplicity and compatibility with existing architectures.

Abstract

This paper presents a novel Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer), which overcomes the weakness in existing transformer-based methods for 3D human pose estimation that the derivation of Q, K, V vectors in their self-attention mechanisms are all based on simple linear mapping. We propose two prior attention modules, namely Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take advantage of the known anatomical structure of the human body and motion trajectory information, to facilitate effective learning of global dependencies and features in the multi-head self-attention. KPA models kinematic relationships in the human body by constructing a topology of kinematics, while TPA builds a trajectory topology to learn the information of joint motion trajectory across frames. Yielding Q, K, V vectors with prior knowledge, the two modules enable KTPFormer to model both spatial and temporal correlations simultaneously. Extensive experiments on three benchmarks (Human3.6M, MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in comparison to state-of-the-art methods. More importantly, our KPA and TPA modules have lightweight plug-and-play designs and can be integrated into various transformer-based networks (i.e., diffusion-based) to improve the performance with only a very small increase in the computational overhead. The code is available at: https://github.com/JihuaPeng/KTPFormer.

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

TL;DR

Abstract

Paper Structure (13 sections, 14 equations, 3 figures, 7 tables)

This paper contains 13 sections, 14 equations, 3 figures, 7 tables.

Introduction
Related Work
Method
Kinematics-Enhanced Transformer
Trajectory-Enhanced Transformer
Stacked Spatio-Temporal Encoders
Regression Head
Experiments
Datasets and Protocols
Comparison with State-of-the-art Methods
Ablation Study
Qualitative Analysis and Discussion
Conclusion

Figures (3)

Figure 1: Top: the spatial local topology (fixed) plus the simulated spatial global topology (learnable) to form the kinematics topology (learnable). Bottom: the temporal local topology (fixed) plus the simulated temporal global topology (learnable) to form the joint motion trajectory topology (learnable).
Figure 2: Overview of Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer). The input 2D pose sequence $P_{TN} \in \mathbb{R} ^{T \times N \times 2}$ with $T$ frames and $N$ joints is first fed into the Kinematics-Enhanced Transformer. KPA injects the kinematic information into the $P_{TN}$, aiming to obtain high-dimensional spatial tokens $H_{TN} \in \mathbb{R}^{T \times N \times d_{m}}$. Then, $H_{TN}$ is split into $Q_{S}$, $K_{S}$, $V_{S}$, which are then fed into the Spatial MHSA. The Trajectory-Enhanced Transformer takes a sequence of reshaped tokens $P_{NT} \in \mathbb{R} ^{N \times T \times d_{m}}$ as input. The stacked TPA blocks with the residual connection yield the temporal tokens $H_{NT} \in \mathbb{R}^{N \times T \times d_{m}}$, which are then sliced into $Q_{T}$, $K_{T}$, $V_{T}$ for the Temporal MHSA.
Figure 3: Comparison of visualization results and attention maps between ours and MixSTE zhang2022mixste. The x-axis and y-axis correspond to the queries and the predicted outputs, respectively.

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

TL;DR

Abstract

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)