Table of Contents
Fetching ...

Orientation-Aware Leg Movement Learning for Action-Driven Human Motion Prediction

Chunzhi Gu, Chao Zhang, Shigeru Kuriyama

TL;DR

Problem: action-conditioned human motion prediction requires transitioning between actions in a way that respects orientation changes, which most datasets lack transition data for. The authors propose a two-stage framework that first generates a target motion conditioned on a future action using a diffusion-based Motion Diffusion Model and then performs action-conditioned in-betweening with a CVAE called AinB-VAE that is augmented with an orientation-warping module. Key contributions include modeling transitions as in-betweening with limited gait-focused actions, an orientation-aware decoding mechanism via cross-attention, and a diversity sampler to capture inter- and intra-class variability, all without relying on ground-truth transition labels. Experiments on BABEL, HumanAct12, and NTU RGB-D show state-of-the-art perceptual quality, action faithfulness, and broad generalization, with robust performance even under dataset noise. The approach reduces the need for annotated transitions and offers realistic, diverse motion predictions suitable for robotics, animation, and perception tasks.

Abstract

The task of action-driven human motion prediction aims to forecast future human motion based on the observed sequence while respecting the given action label. It requires modeling not only the stochasticity within human motion but the smooth yet realistic transition between multiple action labels. However, the fact that most datasets do not contain such transition data complicates this task. Existing work tackles this issue by learning a smoothness prior to simply promote smooth transitions, yet doing so can result in unnatural transitions especially when the history and predicted motions differ significantly in orientations. In this paper, we argue that valid human motion transitions should incorporate realistic leg movements to handle orientation changes, and cast it as an action-conditioned in-betweening (ACB) learning task to encourage transition naturalness. Because modeling all possible transitions is virtually unreasonable, our ACB is only performed on very few selected action classes with active gait motions, such as Walk or Run. Specifically, we follow a two-stage forecasting strategy by first employing the motion diffusion model to generate the target motion with a specified future action, and then producing the in-betweening to smoothly connect the observation and prediction to eventually address motion prediction. Our method is completely free from the labeled motion transition data during training. To show the robustness of our approach, we generalize our trained in-betweening learning model on one dataset to two unseen large-scale motion datasets to produce natural transitions. Extensive experimental evaluations on three benchmark datasets demonstrate that our method yields the state-of-the-art performance in terms of visual quality, prediction accuracy, and action faithfulness.

Orientation-Aware Leg Movement Learning for Action-Driven Human Motion Prediction

TL;DR

Problem: action-conditioned human motion prediction requires transitioning between actions in a way that respects orientation changes, which most datasets lack transition data for. The authors propose a two-stage framework that first generates a target motion conditioned on a future action using a diffusion-based Motion Diffusion Model and then performs action-conditioned in-betweening with a CVAE called AinB-VAE that is augmented with an orientation-warping module. Key contributions include modeling transitions as in-betweening with limited gait-focused actions, an orientation-aware decoding mechanism via cross-attention, and a diversity sampler to capture inter- and intra-class variability, all without relying on ground-truth transition labels. Experiments on BABEL, HumanAct12, and NTU RGB-D show state-of-the-art perceptual quality, action faithfulness, and broad generalization, with robust performance even under dataset noise. The approach reduces the need for annotated transitions and offers realistic, diverse motion predictions suitable for robotics, animation, and perception tasks.

Abstract

The task of action-driven human motion prediction aims to forecast future human motion based on the observed sequence while respecting the given action label. It requires modeling not only the stochasticity within human motion but the smooth yet realistic transition between multiple action labels. However, the fact that most datasets do not contain such transition data complicates this task. Existing work tackles this issue by learning a smoothness prior to simply promote smooth transitions, yet doing so can result in unnatural transitions especially when the history and predicted motions differ significantly in orientations. In this paper, we argue that valid human motion transitions should incorporate realistic leg movements to handle orientation changes, and cast it as an action-conditioned in-betweening (ACB) learning task to encourage transition naturalness. Because modeling all possible transitions is virtually unreasonable, our ACB is only performed on very few selected action classes with active gait motions, such as Walk or Run. Specifically, we follow a two-stage forecasting strategy by first employing the motion diffusion model to generate the target motion with a specified future action, and then producing the in-betweening to smoothly connect the observation and prediction to eventually address motion prediction. Our method is completely free from the labeled motion transition data during training. To show the robustness of our approach, we generalize our trained in-betweening learning model on one dataset to two unseen large-scale motion datasets to produce natural transitions. Extensive experimental evaluations on three benchmark datasets demonstrate that our method yields the state-of-the-art performance in terms of visual quality, prediction accuracy, and action faithfulness.
Paper Structure (10 sections, 9 equations, 6 figures, 4 tables)

This paper contains 10 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example of the action-driven prediction performance of our method compared against WAT mao2022weakly. Both methods predict the future sequence given the action "Walk". WAT mao2022weakly suffers from unnatural transition (e.g., foot sliding) or poor label faithfulness. Our method jointly yields natural transition with plausible leg movements and respects the action label better.
  • Figure 2: Overview of our method. (a) depicts our transition generation framework$-$AinB-VAE. (b) gives the pipeline of our action-driven human motion prediction, which involves two stages. The two stages can be alternately performed to predict long-term motions with natural transitions in a recursive fashion.
  • Figure 3: Orientation-warping module (left) and the AinB-VAE Decoder (right).
  • Figure 4: Qualitative results of inter-action transition diversity given conditions $(\mathbf{X}_s, \mathbf{X}_e,\mathbf{a}^b)$ as input.
  • Figure 5: Qualitative results of intra-action transition diversity by setting in-betweening action $\mathbf{a}^b$ to Step and Run as examples. All the results are obtained via diversity sampler.
  • ...and 1 more figures