H-MoRe: Learning Human-centric Motion Representation for Action Analysis
Zhanbo Huang, Xiaoming Liu, Yu Kong
TL;DR
H-MoRe addresses the challenge of extracting precise human-centric motion by introducing world-local flows that capture both absolute (world) and relative (local) motion. The method learns in a self-supervised manner using a joint constraint framework with skeleton and boundary terms, guided by pose-derived skeleton offsets and boundary alignment, while efficiently deriving local flow from world flow via a lightweight network. It shows substantial gains across gait recognition, action recognition, and video generation, and demonstrates real-time performance, robustness to background and overlap, and improved boundary delineation through boundary-aware edges and patch-centroid Chamfer approximations. The approach offers a practical, scalable path to leveraging accurate motion and body-shape information for diverse human-centric tasks, with limitations to 2D, single-scene settings and prospects for 3D extension and multi-subject scenarios in future work.
Abstract
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
