Table of Contents
Fetching ...

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Jijie He, Wenwu Yang

TL;DR

This work tackles video-based human pose regression by addressing the inefficiency of heatmap-based multi-frame methods. It introduces Decoupled Space-Time Aggregation (DSTA), which represents each joint as a dedicated feature token and decouples spatial and temporal dependencies via a Joint-centric Feature Decoder (JFD) and Space-Time Decoupling (STD) with local-awareness attention. By modeling temporal dynamics per joint rather than across the whole pose, DSTA achieves large gains over image-based regression (e.g., up to $8.9$ mAP on PoseTrack2017) and competes with state-of-the-art heatmap-based methods, while drastically reducing head computation to about $0.02$ GFLOPs. Evaluations on PoseTrack2017/2018/2021 demonstrate strong performance and efficiency, highlighting DSTA's suitability for real-time and edge deployments in video pose estimation.

Abstract

By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

TL;DR

This work tackles video-based human pose regression by addressing the inefficiency of heatmap-based multi-frame methods. It introduces Decoupled Space-Time Aggregation (DSTA), which represents each joint as a dedicated feature token and decouples spatial and temporal dependencies via a Joint-centric Feature Decoder (JFD) and Space-Time Decoupling (STD) with local-awareness attention. By modeling temporal dynamics per joint rather than across the whole pose, DSTA achieves large gains over image-based regression (e.g., up to mAP on PoseTrack2017) and competes with state-of-the-art heatmap-based methods, while drastically reducing head computation to about GFLOPs. Evaluations on PoseTrack2017/2018/2021 demonstrate strong performance and efficiency, highlighting DSTA's suitability for real-time and edge deployments in video pose estimation.

Abstract

By leveraging temporal dependency in video sequences, multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations, such as occlusion, motion blur, and video defocus. These algorithms are predominantly based on heatmaps, resulting in high computation and storage requirements per frame, which limits their flexibility and real-time application in video scenarios, particularly on edge devices. In this paper, we develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose, the temporal trajectory of each individual joint exhibits relative independence. In light of this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint, thereby avoiding the conflation of spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism, our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods, DSTA significantly enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore, our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA.
Paper Structure (16 sections, 11 equations, 4 figures, 10 tables)

This paper contains 16 sections, 11 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: (a) Compared to our proposed video-based regression method, previous image-based regression methods of RLE RLE_ICCV2021 and Poseur Poseur_ECCV2022 have a substantial performance decline when processing video input, e.g., the dataset of PoseTrack2017 PoseTrack2017_CVPR2017. (b) Despite the intrinsic spatial correlations among human body joints, each joint exhibits independent motion trajectories temporally.
  • Figure 2: The pipeline of the proposed Decoupled Space-Time Aggregation (DSTA). The goal is to detect the human pose of the key frame $\mathcal{I}_i(t)$. Given a video sequence $\langle\mathcal{I}_i({t-T}),\dots,\mathcal{I}_i(t),\dots,\mathcal{I}_i({t+T})\rangle$, DSTA uses a backbone network to extract their feature maps. From these maps, Joint-centric Feature Decoder (JFD) extracts feature tokens to individually represent each joint. Space-Time Decoupling (STD) then models the temporal dynamic dependencies and spatial structural dependencies of joints separately, producing aggregated space-time features for the current key frame. Each of these aggregated features is utilized to regress the coordinates of the corresponding joint.
  • Figure 3: Qualitative comparison of a) our DSTA, b) DCPose DCPose_CVPR2021, c) Poseur Poseur_ECCV2022, and d) RLE RLE_ICCV2021 on the PoseTrack datasets, featuring challenges such as occlusions, nearby-person interactions, and motion blur. Inaccurate predictions are marked with red solid circles.
  • Figure 4: Additional qualitative results of our DSTA on the PoseTrack datasets.