Table of Contents
Fetching ...

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang

TL;DR

The paper tackles end-to-end multi-person 2D pose estimation in videos, proposing PAVE-Net to replace detector-based pipelines and heuristic postprocessing. It combines a spatial encoder, a pose-aware spatiotemporal decoder, and a joint refinement stage to robustly aggregate temporal features for consistent identities across frames. Training uses a set-based Hungarian loss and a residual log-likelihood pose regression loss, achieving a 6.0 mAP improvement on PoseTrack2017 and competitive performance against state-of-the-art two-stage video methods. The approach offers substantial efficiency gains, with near-constant inference time as scene complexity grows, making it attractive for real-time, multi-person video applications.

Abstract

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet.

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

TL;DR

The paper tackles end-to-end multi-person 2D pose estimation in videos, proposing PAVE-Net to replace detector-based pipelines and heuristic postprocessing. It combines a spatial encoder, a pose-aware spatiotemporal decoder, and a joint refinement stage to robustly aggregate temporal features for consistent identities across frames. Training uses a set-based Hungarian loss and a residual log-likelihood pose regression loss, achieving a 6.0 mAP improvement on PoseTrack2017 and competitive performance against state-of-the-art two-stage video methods. The approach offers substantial efficiency gains, with near-constant inference time as scene complexity grows, making it attractive for real-time, multi-person video applications.

Abstract

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparison of two-stage and end-to-end frameworks for video-based MMPE. (a) To predict 2D poses in the current frame, existing methods PoseWarper_NIPS2019DCPose_CVPR2021OTPose_SMC2022TDMI_CVPR2023DiffPose-ICCV2023DSTA-CVPR2024 first crop regions from consecutive frames for each human instance and then input them into a temporal model to perform single-person pose estimation (SPPE). (b) PAVE-Net achieves end-to-end video-based MMPE with a spatial encoder and spatiotemporal decoder.
  • Figure 2: (a) Video transformer baseline. (b) Pose-aware video transformer (PAVE-Net) architecture. The goal is to detect all human poses in the current frame $F(t)$ by leveraging temporal dynamics from a sequence of consecutive frames $\langle F(t-T), \dots, F(t), \dots, F(t+T) \rangle$. PAVE-Net employs a backbone network to extract multi-scale features from each frame, which are transformed into feature tokens. A Spatial Encoder (SE) processes each frame independently to capture local dependencies within its tokens. The Spatiotemporal Pose Decoder (STPD) then models global dependencies between pose queries and feature tokens across all frames, using the top $M$ highest-confidence poses regressed from the feature tokens of the current frame $t$ as references. This enables accurate prediction of 2D poses for frame $t$, which are further refined by a joint decoder.
  • Figure 3: Qualitative comparison of our PAVE-Net, PETR PEDR-CVPR2022, DSTA DSTA-CVPR2024, and DCPose DCPose_CVPR2021, highlighting challenges such as occlusions, motion blur, and crowded scenarios. The top two rows are from the PoseTrack dataset, while the bottom two rows are from in-the-wild videos. Inaccurate predictions are marked with red solid circles. Better viewed with zoom.
  • Figure 4: Features attended by the query token of the central target person across consecutive frames, highlighted with colored circles. While our pose-aware attention focuses exclusively on the target person's features (a), features from other individuals are mistakenly attended without it (b). Best viewed with zoom.
  • Figure 5: Additional qualitative results of our method on the PoseTrack validation sets and in-the-wild videos.