Table of Contents
Fetching ...

Kinematics Modeling Network for Video-based Human Pose Estimation

Yonghao Dang, Jianqin Yin, Shaojie Zhang, Jiping Liu, Yanzhu Hu

TL;DR

This work tackles video-based human pose estimation by explicitly modeling temporal correlations between joints across frames. It introduces the Kinematics Modeling Module (KMM) and a KIMNet architecture that formulates pose estimation as a Markov decision process, enabling recursive joint localization guided by motion cues from all joints in the previous frame. The approach achieves state-of-the-art results on Penn Action and Sub-JHMDB and shows robustness to occlusion, while remaining compatible with existing pose frameworks. Ablation analyses confirm the value of joint-level temporal modeling, high-resolution feature propagation, and the attention-based KMM design. The method offers a practical pathway to more reliable pose tracking in challenging videos, with limitations discussed and avenues for incorporating priors in extreme cases.

Abstract

Estimating human poses from videos is critical in human-computer interaction. Joints cooperate rather than move independently during human movement. There are both spatial and temporal correlations between joints. Despite the positive results of previous approaches, most focus on modeling the spatial correlation between joints while only straightforwardly integrating features along the temporal dimension, ignoring the temporal correlation between joints. In this work, we propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints across different frames by calculating their temporal similarity. In this way, KMM can capture motion cues of the current joint relative to all joints in different time. Besides, we formulate video-based human pose estimation as a Markov Decision Process and design a novel kinematics modeling network (KIMNet) to simulate the Markov Chain, allowing KIMNet to locate joints recursively. Our approach achieves state-of-the-art results on two challenging benchmarks. In particular, KIMNet shows robustness to the occlusion. The code will be released at https://github.com/YHDang/KIMNet.

Kinematics Modeling Network for Video-based Human Pose Estimation

TL;DR

This work tackles video-based human pose estimation by explicitly modeling temporal correlations between joints across frames. It introduces the Kinematics Modeling Module (KMM) and a KIMNet architecture that formulates pose estimation as a Markov decision process, enabling recursive joint localization guided by motion cues from all joints in the previous frame. The approach achieves state-of-the-art results on Penn Action and Sub-JHMDB and shows robustness to occlusion, while remaining compatible with existing pose frameworks. Ablation analyses confirm the value of joint-level temporal modeling, high-resolution feature propagation, and the attention-based KMM design. The method offers a practical pathway to more reliable pose tracking in challenging videos, with limitations discussed and avenues for incorporating priors in extreme cases.

Abstract

Estimating human poses from videos is critical in human-computer interaction. Joints cooperate rather than move independently during human movement. There are both spatial and temporal correlations between joints. Despite the positive results of previous approaches, most focus on modeling the spatial correlation between joints while only straightforwardly integrating features along the temporal dimension, ignoring the temporal correlation between joints. In this work, we propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints across different frames by calculating their temporal similarity. In this way, KMM can capture motion cues of the current joint relative to all joints in different time. Besides, we formulate video-based human pose estimation as a Markov Decision Process and design a novel kinematics modeling network (KIMNet) to simulate the Markov Chain, allowing KIMNet to locate joints recursively. Our approach achieves state-of-the-art results on two challenging benchmarks. In particular, KIMNet shows robustness to the occlusion. The code will be released at https://github.com/YHDang/KIMNet.
Paper Structure (31 sections, 14 equations, 9 figures, 7 tables)

This paper contains 31 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparisons between our approach and other existing methods. (a) LSTM-based methods. (b) Optical flow-based approaches. (c) CNN-based methods. (d) The proposed KIMNet.
  • Figure 2: Overview of the proposed kinematics modeling network.
  • Figure 3: The structure of the proposed kinematics modeling module.
  • Figure 4: Visualization of the temporal correlation between joints. The size of joints at frame $t$ represents the degree of temporal dependency between joints.
  • Figure 5: Outputs of the proposed KMM and KIMNet. The visualization results from left to right are the original input frames, ground truth, outputs of the KMM, and outputs of the KIMNet.
  • ...and 4 more figures