Kinematics Modeling Network for Video-based Human Pose Estimation
Yonghao Dang, Jianqin Yin, Shaojie Zhang, Jiping Liu, Yanzhu Hu
TL;DR
This work tackles video-based human pose estimation by explicitly modeling temporal correlations between joints across frames. It introduces the Kinematics Modeling Module (KMM) and a KIMNet architecture that formulates pose estimation as a Markov decision process, enabling recursive joint localization guided by motion cues from all joints in the previous frame. The approach achieves state-of-the-art results on Penn Action and Sub-JHMDB and shows robustness to occlusion, while remaining compatible with existing pose frameworks. Ablation analyses confirm the value of joint-level temporal modeling, high-resolution feature propagation, and the attention-based KMM design. The method offers a practical pathway to more reliable pose tracking in challenging videos, with limitations discussed and avenues for incorporating priors in extreme cases.
Abstract
Estimating human poses from videos is critical in human-computer interaction. Joints cooperate rather than move independently during human movement. There are both spatial and temporal correlations between joints. Despite the positive results of previous approaches, most focus on modeling the spatial correlation between joints while only straightforwardly integrating features along the temporal dimension, ignoring the temporal correlation between joints. In this work, we propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints across different frames by calculating their temporal similarity. In this way, KMM can capture motion cues of the current joint relative to all joints in different time. Besides, we formulate video-based human pose estimation as a Markov Decision Process and design a novel kinematics modeling network (KIMNet) to simulate the Markov Chain, allowing KIMNet to locate joints recursively. Our approach achieves state-of-the-art results on two challenging benchmarks. In particular, KIMNet shows robustness to the occlusion. The code will be released at https://github.com/YHDang/KIMNet.
