Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang; Jiahang Cai; Xun Wang; Wenwu Yang

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Hongwei Fang, Jiahang Cai, Xun Wang, Wenwu Yang

TL;DR

TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation, and develops a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame.

Abstract

Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 6 figures, 14 tables)

This paper contains 17 sections, 6 equations, 6 figures, 14 tables.

Introduction
Related Work
Image-Based Human Pose Estimation
Video-Based Human Pose Estimation
Our Approach
ViTPose Revisited and Beyond
Joint-centric Temporal Aggregation
Mask-aware Feature-to-Joint Attention
Global Restoring Attention
Experiments
Experimental Settings
Main Results
Comparison with the ViTPose Baseline
Comparison with State-of-the-Art Methods
Ablation Study
...and 2 more sections

Figures (6)

Figure 1: Comparison between the baseline ViTPose pipeline (a) and our TAR-ViTPose (b). (a) ViTPose adopts a ViT encoder to extract latent features from the input image, which are then fed into a lightweight decoder to regress keypoint heatmaps. (b) Our method enhances the current-frame representation by aggregating temporal cues from adjacent frames, achieving plug-and-play temporal modeling within the original ViTPose architecture.
Figure 2: The pipeline of the proposed Temporal Aggregate-and-Restore Vision Transformer (TAR-ViTPose). The objective is to estimate the human pose of the current frame $X_i(t)$. Given a video sequence $\langle X_i(t\!-\!T), \dots, X_i(t), \dots, X_i(t\!+\!T) \rangle$, each frame is first encoded by the ViT encoder to obtain latent features $\{ F_i^{\text{out}}(\tau) \}_{\tau = t-T}^{t+T}$. The JTA precisely aligns and aggregates keypoint features across frames. To achieve this, a query token is assigned to each joint ($Q$), and a mask-aware attention selectively attends to its corresponding joint regions in neighboring frames, guided by masks $M(\tau)_{\tau = t-T}^{t+T}$ derived from the decoded keypoint heatmaps $\overline{H}(\tau)_{\tau = t-T}^{t+T}$. Subsequently, the GRA injects the aggregated temporal features $\widetilde{Q}$ back into the current frame's latent representation, producing an enhanced feature $\widehat{F}_i^{\text{out}}(t)$, which is then fed into the lightweight decoder to generate the final keypoint heatmaps $H_i(t)$ for the current frame.
Figure 3: Qualitative comparison of a) our TAR-ViTPose, b) ViTPose Vitpose_NIPS2022, c) DCPose DCPose_CVPR2021, d) DSTA DSTA-CVPR2024, and e) Poseidon Poseidon-Arxiv2025, featuring challenges such as occlusion, motion blur, and defocus. The first two columns are from the PoseTrack datasets, while the last two columns are from in-the-wild videos. Inaccurate predictions are marked with red solid circles. Zoom in for clarity.
Figure 4: Visualization of attention heatmaps for joint query tokens with (b) and without (a) mask-aware attention. Given a current frame $X_i(t)$ and a neighboring frame $X_i(t-T)$, we visualize the attention heatmaps of three different joint query tokens with respect to the features of $X_i(t-T)$. See Supp. Material for more.
Figure 5: Additional qualitative results of our TAR-ViTPose. The first two rows are from the PoseTrack datasets, while the last row is from in-the-wild videos. Across all examples, the model remains robust under occlusion, motion blur, complex poses, and defocus.
...and 1 more figures

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

TL;DR

Abstract

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)