Table of Contents
Fetching ...

Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, Zeng Huang, Hao Li

TL;DR

The paper tackles the challenge of long-horizon, realistic motion synthesis by addressing error accumulation in autoregressive models. It introduces the auto-conditioned RNN (acRNN) that trains the network to condition on its own outputs with a fixed length, enabling sustained generation of diverse motions such as dances and martial arts. Across quantitative and qualitative evaluations on CMU datasets, acRNN demonstrates markedly improved long-term stability, generating hundreds of seconds of coherent motion without permanent divergence or freezing, outperforming prior RNN-based approaches. This approach has practical implications for real-time animation and VR, enabling richer, stylistically varied motion generation without relying on extensive databases or hand-crafted priors.

Abstract

We present a real-time method for synthesizing highly complex human motions using a novel training regime we call the auto-conditioned Recurrent Neural Network (acRNN). Recently, researchers have attempted to synthesize new motion by using autoregressive techniques, but existing methods tend to freeze or diverge after a couple of seconds due to an accumulation of errors that are fed back into the network. Furthermore, such methods have only been shown to be reliable for relatively simple human motions, such as walking or running. In contrast, our approach can synthesize arbitrary motions with highly complex styles, including dances or martial arts in addition to locomotion. The acRNN is able to accomplish this by explicitly accommodating for autoregressive noise accumulation during training. Our work is the first to our knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.

Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

TL;DR

The paper tackles the challenge of long-horizon, realistic motion synthesis by addressing error accumulation in autoregressive models. It introduces the auto-conditioned RNN (acRNN) that trains the network to condition on its own outputs with a fixed length, enabling sustained generation of diverse motions such as dances and martial arts. Across quantitative and qualitative evaluations on CMU datasets, acRNN demonstrates markedly improved long-term stability, generating hundreds of seconds of coherent motion without permanent divergence or freezing, outperforming prior RNN-based approaches. This approach has practical implications for real-time animation and VR, enabling richer, stylistically varied motion generation without relying on extensive databases or hand-crafted priors.

Abstract

We present a real-time method for synthesizing highly complex human motions using a novel training regime we call the auto-conditioned Recurrent Neural Network (acRNN). Recently, researchers have attempted to synthesize new motion by using autoregressive techniques, but existing methods tend to freeze or diverge after a couple of seconds due to an accumulation of errors that are fed back into the network. Furthermore, such methods have only been shown to be reliable for relatively simple human motions, such as walking or running. In contrast, our approach can synthesize arbitrary motions with highly complex styles, including dances or martial arts in addition to locomotion. The acRNN is able to accomplish this by explicitly accommodating for autoregressive noise accumulation during training. Our work is the first to our knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Visual diagram of an unrolled Auto-Conditioned RNN (right) with condition length $v=4$ and ground truth length $u=4$. $I_t$ is the input at time step $t$. $S_t$ is the hidden state. $O_t$ is the output.
  • Figure 2: Motion change between subsequent frames of different motion styles, given as Euclidean distance in prediction results, at different frames. All acLSTM networks here are trained with condition length 5. Predictions are generated with 10 frames (approximately 170 ms) of seed motion from test set. Results are averaged over 20 random seed motions. Low value in motion change indicates the freezing of motion. Note that acLSTM and vanilla have exactly the same architecture - differences are due solely to training. Results averaged over 20 seed motions.
  • Figure 3: Comparison between the vanilla LSTM and our method at 250,000 iterations of training. top: vanilla LSTM, bottom: acLSTM. The two synthesized motions are initialized with the same 10 frames of ground truth motion. The motion generated by the vanilla LSTM freezes after around 60 frames. Our method does not freeze.
  • Figure 4: Motion sequences generated by acLSTM, sampled at various frames. Motion style from top to bottom: martial arts, Indian dancing, Indian/salsa hybrid and walking. All the motions are generated at 60 fps, and are initialized with 10 frames of ground truth data randomly picked up from the database. The number at the bottom of each image is the frame index. The images are rendered with BVHViewer 1.1 bvh_site
  • Figure 5: Sample frames from a 300+ second generated sequence. Note that no sequence in the training set exceeds 30 seconds of contiguous motion.
  • ...and 5 more figures