Table of Contents
Fetching ...

H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Zhanbo Huang, Xiaoming Liu, Yu Kong

TL;DR

H-MoRe addresses the challenge of extracting precise human-centric motion by introducing world-local flows that capture both absolute (world) and relative (local) motion. The method learns in a self-supervised manner using a joint constraint framework with skeleton and boundary terms, guided by pose-derived skeleton offsets and boundary alignment, while efficiently deriving local flow from world flow via a lightweight network. It shows substantial gains across gait recognition, action recognition, and video generation, and demonstrates real-time performance, robustness to background and overlap, and improved boundary delineation through boundary-aware edges and patch-centroid Chamfer approximations. The approach offers a practical, scalable path to leveraging accurate motion and body-shape information for diverse human-centric tasks, with limitations to 2D, single-scene settings and prospects for 3D extension and multi-subject scenarios in future work.

Abstract

In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.

H-MoRe: Learning Human-centric Motion Representation for Action Analysis

TL;DR

H-MoRe addresses the challenge of extracting precise human-centric motion by introducing world-local flows that capture both absolute (world) and relative (local) motion. The method learns in a self-supervised manner using a joint constraint framework with skeleton and boundary terms, guided by pose-derived skeleton offsets and boundary alignment, while efficiently deriving local flow from world flow via a lightweight network. It shows substantial gains across gait recognition, action recognition, and video generation, and demonstrates real-time performance, robustness to background and overlap, and improved boundary delineation through boundary-aware edges and patch-centroid Chamfer approximations. The approach offers a practical, scalable path to leveraging accurate motion and body-shape information for diverse human-centric tasks, with limitations to 2D, single-scene settings and prospects for 3D extension and multi-subject scenarios in future work.

Abstract

In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.

Paper Structure

This paper contains 31 sections, 14 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of our H-MoRe with other motion representations. We visualize three pose-related representations -- 2D Pose fang2022alphapose, 3D Pose sarandi2023dozens, and PoseFlow zhang2018poseflow -- as well as flow-related representations: Optical Flow teed2020raft, and our H-MoRe (contains world flow and local flow). The red box highlights H-MoRe's precise motion information and sharp boundaries.
  • Figure 2: Whole Pipeline of H-MoRe. From left to right: (a) The training and inference pipeline for world-local flows (outlined by a gray dashed line) and the use of the joint constraints learning framework for self-supervised learning from real-world scenarios. Blue symbols and lines denote the boundary constraint $\mathcal{G}$ , while green symbols and lines indicate the skeleton constraint $\mathcal{F}$. Backpropagation gradients are shown as dashed lines in corresponding colors; (b) the internal implementation of $\Phi$; and (c) the internal implementation of $\Psi$.
  • Figure 3: Definition of world-local flows. World motion $\boldsymbol{M_w}$ is movement relative to the environment, while local motion $\boldsymbol{M_l}$ is relative to the subject. Using the subject's overall motion $\boldsymbol{v_s}$, these can be converted via vector composition and decomposition.
  • Figure 4: Overview of joint constraint learning framework. We introduce two constraints -- (1) the boundary constraint $\mathcal{G}$ (blue), which aligns the human boundary with the flow edges to maintain consistent shapes; and (2) the skeleton constraint $\mathcal{F}$, which uses skeleton offsets to regulate body point movements, ensuring consistent motion, as reflected by matching colors in the visualization.
  • Figure 5: Details of our joint constraint. (a) matching any body point $p$ to its corresponding skeleton point $\hat{q}$; (b) an angular constraint to align estimated motion with skeleton's movement directions; (c) an intensity constraint to ensure consistent motion magnitude between estimated motion and skeleton offsets; (d) matching any point $i$ on the flow edges $\boldsymbol{s}$ to its corresponding $\hat{j}$ on human boundaries $\boldsymbol{e}$; and (e) calculation of our patch-centroid distance.
  • ...and 9 more figures