Table of Contents
Fetching ...

Human Motion Prediction, Reconstruction, and Generation

Canxuan Gang, Yiran Wang

TL;DR

This survey consolidates advances across three intertwined strands of human motion: prediction, reconstruction, and generation, highlighting how transformers, diffusion models, and physics-informed priors address challenges like occlusion, long-term consistency, and fine-grained hand-object interactions. It surveys predictive methods that handle perturbations and scene constraints, reconstruction approaches that achieve real-time 3D mesh tracking under noise and occlusion, and generation techniques that synthesize diverse, controllable motions from actions, text, or language prompts. Key contributions include diffusion-based pipelines for robust motion reconstruction, latent-dis diffusion and discrete-token frameworks for text-to-motion, and long-sequence generation strategies that enable coherent action transitions. Collectively, the work underlines the importance of physics, scene context, and modal conditioning for realistic and applicable motion synthesis in robotics, AR/VR, gaming, and animation.

Abstract

This report reviews recent advancements in human motion prediction, reconstruction, and generation. Human motion prediction focuses on forecasting future poses and movements from historical data, addressing challenges like nonlinear dynamics, occlusions, and motion style variations. Reconstruction aims to recover accurate 3D human body movements from visual inputs, often leveraging transformer-based architectures, diffusion models, and physical consistency losses to handle noise and complex poses. Motion generation synthesizes realistic and diverse motions from action labels, textual descriptions, or environmental constraints, with applications in robotics, gaming, and virtual avatars. Additionally, text-to-motion generation and human-object interaction modeling have gained attention, enabling fine-grained and context-aware motion synthesis for augmented reality and robotics. This review highlights key methodologies, datasets, challenges, and future research directions driving progress in these fields.

Human Motion Prediction, Reconstruction, and Generation

TL;DR

This survey consolidates advances across three intertwined strands of human motion: prediction, reconstruction, and generation, highlighting how transformers, diffusion models, and physics-informed priors address challenges like occlusion, long-term consistency, and fine-grained hand-object interactions. It surveys predictive methods that handle perturbations and scene constraints, reconstruction approaches that achieve real-time 3D mesh tracking under noise and occlusion, and generation techniques that synthesize diverse, controllable motions from actions, text, or language prompts. Key contributions include diffusion-based pipelines for robust motion reconstruction, latent-dis diffusion and discrete-token frameworks for text-to-motion, and long-sequence generation strategies that enable coherent action transitions. Collectively, the work underlines the importance of physics, scene context, and modal conditioning for realistic and applicable motion synthesis in robotics, AR/VR, gaming, and animation.

Abstract

This report reviews recent advancements in human motion prediction, reconstruction, and generation. Human motion prediction focuses on forecasting future poses and movements from historical data, addressing challenges like nonlinear dynamics, occlusions, and motion style variations. Reconstruction aims to recover accurate 3D human body movements from visual inputs, often leveraging transformer-based architectures, diffusion models, and physical consistency losses to handle noise and complex poses. Motion generation synthesizes realistic and diverse motions from action labels, textual descriptions, or environmental constraints, with applications in robotics, gaming, and virtual avatars. Additionally, text-to-motion generation and human-object interaction modeling have gained attention, enabling fine-grained and context-aware motion synthesis for augmented reality and robotics. This review highlights key methodologies, datasets, challenges, and future research directions driving progress in these fields.

Paper Structure

This paper contains 31 sections, 25 figures.

Figures (25)

  • Figure 1: Overview of the proposed method: The model generates diverse and smooth motion predictions by leveraging a normalizing flow-based pose prior and a structured generation process.
  • Figure 2: Overview of the Latent Differentiable Physics (LDP) model. The model maps full-body poses to the Inverted Pendulum Model (IPM) state, simulates the interaction forces, and reconstructs the full-body motion, enabling effective prediction under physical perturbations.
  • Figure 3: Overview of the DiMoP3D architecture. The system incorporates a Context-aware Intermodal Interpreter to analyze 3D scenes, a Behaviorally-consistent Stochastic Planner for planning motion trajectories, and a Self-prompted Motion Generator to produce diverse and physically consistent motion sequences.
  • Figure 4: Overview of the contact-aware human motion forecasting approach. The system predicts future contact maps (left) and uses these maps to forecast future human poses (right), ensuring consistency between the global motion and local poses.
  • Figure 5: Overview of the 4DHumans system. Left: HMR 2.0 for human mesh recovery; Right: 4DHumans for joint reconstruction and tracking in video.
  • ...and 20 more figures