Table of Contents
Fetching ...

Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction

Juncheng Hu, Zijian Zhang, Zeyu Wang, Guoyu Wang, Yingji Li, Kedi Lyu

TL;DR

This work tackles 3D human motion prediction, a problem plagued by high-dimensional, stochastic pose histories. It introduces an Active Perception Strategy (APS) with a Data Perception Module that projects poses into a quotient space using tangent vectors $v \in T_p\mathcal{M}$ and Grassmann projections in $Gr(k,d)$, and a Network Perception Module that employs perturbation-based auxiliary supervision within a spatio-temporal transformer trained by a Wasserstein-GP objective. Contributions include a model-agnostic APS framework, the DPM/NPM architecture, and state-of-the-art results on H3.6M, CMU Mocap, and 3DPW (e.g., 16.3%/13.9%/10.1% improvements). The quotient-space decomposition reduces coordinate redundancy and decouples geometry from semantics, while the auxiliary perturbation enforces active recovery of spatio-temporal relations, yielding robust long-horizon predictions across diverse baselines.

Abstract

Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.

Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction

TL;DR

This work tackles 3D human motion prediction, a problem plagued by high-dimensional, stochastic pose histories. It introduces an Active Perception Strategy (APS) with a Data Perception Module that projects poses into a quotient space using tangent vectors and Grassmann projections in , and a Network Perception Module that employs perturbation-based auxiliary supervision within a spatio-temporal transformer trained by a Wasserstein-GP objective. Contributions include a model-agnostic APS framework, the DPM/NPM architecture, and state-of-the-art results on H3.6M, CMU Mocap, and 3DPW (e.g., 16.3%/13.9%/10.1% improvements). The quotient-space decomposition reduces coordinate redundancy and decouples geometry from semantics, while the auxiliary perturbation enforces active recovery of spatio-temporal relations, yielding robust long-horizon predictions across diverse baselines.

Abstract

Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.

Paper Structure

This paper contains 13 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The upper part highlights the motivation behind APS, while the lower part demonstrates its effectiveness in mitigating premature performance bottlenecks.
  • Figure 2: The architecture of the proposed method.
  • Figure 3: Visual results of our model on H3.6M and CMU datasets.