Joint-Motion Mutual Learning for Pose Estimation in Videos

Sifan Wu; Haipeng Chen; Yifang Yin; Sihao Hu; Runyang Feng; Yingying Jiao; Ziqi Yang; Zhenguang Liu

Joint-Motion Mutual Learning for Pose Estimation in Videos

Sifan Wu, Haipeng Chen, Yifang Yin, Sihao Hu, Runyang Feng, Yingying Jiao, Ziqi Yang, Zhenguang Liu

TL;DR

This work tackles the challenge of human pose estimation in videos, where defocus and occlusion degrade performance. It introduces JM-Pose, a framework that jointly models local joint dependencies via a context-aware joint learner and global pixel-level motion via a progressive joint-motion mutual learning module, augmented by an information orthogonality objective to encourage diverse, non-redundant cues. The approach yields state-of-the-art results on PoseTrack2017/2018/21, with clear gains in challenging scenes and key joints, demonstrating robustness to complex video dynamics. The proposed mutual learning strategy provides a principled way to fuse heatmap-derived cues and motion flow for accurate, reliable pose estimation in real-world video data.

Abstract

Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation. However, they often ignore the useful joint information encoded in the initial heatmap, which is a by-product of the backbone generation. Comparatively, methods that attempt to refine the initial heatmap fail to consider any spatio-temporal motion features. As a result, the performance of existing methods for pose estimation falls short due to the lack of ability to leverage both local joint (heatmap) information and global motion (feature) dynamics. To address this problem, we propose a novel joint-motion mutual learning framework for pose estimation, which effectively concentrates on both local joint dependency and global pixel-level motion dynamics. Specifically, we introduce a context-aware joint learner that adaptively leverages initial heatmaps and motion flow to retrieve robust local joint feature. Given that local joint feature and global motion flow are complementary, we further propose a progressive joint-motion mutual learning that synergistically exchanges information and interactively learns between joint feature and motion flow to improve the capability of the model. More importantly, to capture more diverse joint and motion cues, we theoretically analyze and propose an information orthogonality objective to avoid learning redundant information from multi-cues. Empirical experiments show our method outperforms prior arts on three challenging benchmarks.

Joint-Motion Mutual Learning for Pose Estimation in Videos

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 5 figures, 6 tables)

This paper contains 16 sections, 11 equations, 5 figures, 6 tables.

Introduction
Related Work
Image-based Human Pose Estimation
Video-based Human Pose Estimation
Method
Context-Aware Joint Learner
Joint-Motion Mutual Learning
Loss Functions
Experiments
Experimental Settings
Quantitative Comparison with State-of-the-art Methods (RQ1)
Performance on Complex and challenging scenes (RQ2)
Comparison of Visual Results (RQ3)
Ablation Study (RQ4)
Conclusion
...and 1 more sections

Figures (5)

Figure 1: Heatmap-based methods such as DCPose refine heatmaps to perform pose estimation but lack the ability to incorporate spatio-temporal features. Conversely, feature-based methods like TDMI employ feature difference to extract the valuable features but neglect local joint dependency present in heatmaps. These methods may encounter estimation difficulties in scenarios involving video defocus and self-occlusion. In comparison, our approach models the joint-motion information with a novel mutual learning framework, which grasps more meaningful and complementary representations, delivering a more robust result.
Figure 2: JM-Pose is designed to estimate human pose in the keyframe $I_t$ with its consecutive supporting frames, e.g., $I_{t-\delta}, I_{t+\delta}$ in the above figure. JM-Pose introduces two key components: Context-aware Joint Learner and Joint-Motion Mutual Learning. The context-aware joint learner is designed to extract the local joint-level feature $J_t$ from motion flow $M_t$ using modulated deformable operations guided by initial heatmap $\hat{H}_t$. Joint-motion mutual learning further refines local joint feature $J_t$ and global motion flow $M_t$ using their knowledge to complement each other. An information orthogonality objective $\mathcal{L}_{IO}$ is adopted to improve the diversity of learned $J_t$ and $M_t$, which is conditioned on initial heatmap $\hat{H}_t$. The final $L^{th}$ representation is aggregated and fed to the detection head to obtain the final heatmap $\mathcal{\hat{H}}_t$ for pose estimation. Finally, we employ a heatmap loss $\mathcal{L}_H$ to measure the discrepancy between the ground truth $\mathcal{H}_t$ and the detected heatmaps $\mathcal{\hat{H}}_t$.
Figure 3: The joint-motion mutual learning framework. Left: The architectures of the $i^{th}$ joint-motion mutual learning and legends. Right: We propose an information orthogonality objective to update the parameters of joint-motion mutual learning and mine diverse local joint feature and global motion flow.
Figure 4: The keyframe (a) and visual comparisons of detection results obtained from DCPose (b), TDMI (c), and our JM-Pose (d) on challenging scenes in the PoseTrack dataset. Inaccurate predictions are highlighted with the red solid circles.
Figure 5: More visual results of JM-Pose on benchmark datasets. Challenging cases such as high-speed motion, video defocus, and pose occlusion are involved.

Joint-Motion Mutual Learning for Pose Estimation in Videos

TL;DR

Abstract

Joint-Motion Mutual Learning for Pose Estimation in Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (5)