AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

Sarmad Idrees; Jongeun Choi; Seokman Sohn

AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

Sarmad Idrees, Jongeun Choi, Seokman Sohn

TL;DR

AdvMT tackles long-term human motion prediction for safe human-robot interaction by coupling a Transformer encoder-based motion model with a temporal continuity discriminator, enabling simultaneous capture of spatial and temporal dependencies and reducing artifacts via adversarial feedback. The approach uses an auto-regressive training regime and a composite loss $\\mathcal{L} = \\\mathcal{L}_{MPJPE} + \\lambda_B \\\mathcal{L}_{bone} + \\lambda_D \\\mathcal{L}_{D_K}$, where $\\mathcal{L}_{MPJPE} = \\\frac{1}{N(T+L)} \\\sum_{t=T+1}^{T+L} \\\sum_{n=1}^N \\\| \\\hat{x}_{t,n} - x_{t,n} \\\|^2$. The temporal discriminator loss $\\mathcal{L}_{D_K}$ enforces plausible joint velocity changes via $\\mathcal{L}_{D_K} = \\\sum_{t=T+1}^{T+L} ( \\\mathbb{E}_{x_t} [ \\\| D_K(\\Delta x_t) \\\|^2 ] + \\\mathbb{E}_{\\hat{x}_t} [ \\\| 1 - D_K(\\Delta \\hat{x}_t) \\\|^2 ] )$, mitigating zero-velocity collapse and reducing error accumulation. Experiments on Human3.6M show AdvMT yields improved long-term accuracy while maintaining strong short-term performance, highlighting its potential for real-time, safe human-robot interaction.

Abstract

To achieve seamless collaboration between robots and humans in a shared environment, accurately predicting future human movements is essential. Human motion prediction has traditionally been approached as a sequence prediction problem, leveraging historical human motion data to estimate future poses. Beginning with vanilla recurrent networks, the research community has investigated a variety of methods for learning human motion dynamics, encompassing graph-based and generative approaches. Despite these efforts, achieving accurate long-term predictions continues to be a significant challenge. In this regard, we present the Adversarial Motion Transformer (AdvMT), a novel model that integrates a transformer-based motion encoder and a temporal continuity discriminator. This combination effectively captures spatial and temporal dependencies simultaneously within frames. With adversarial training, our method effectively reduces the unwanted artifacts in predictions, thereby ensuring the learning of more realistic and fluid human motions. The evaluation results indicate that AdvMT greatly enhances the accuracy of long-term predictions while also delivering robust short-term predictions

AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

TL;DR

, where

. The temporal discriminator loss

enforces plausible joint velocity changes via

, mitigating zero-velocity collapse and reducing error accumulation. Experiments on Human3.6M show AdvMT yields improved long-term accuracy while maintaining strong short-term performance, highlighting its potential for real-time, safe human-robot interaction.

Abstract

Paper Structure (15 sections, 3 equations, 4 figures, 3 tables)

This paper contains 15 sections, 3 equations, 4 figures, 3 tables.

Introduction
Related work
Long-term human motion prediction
Adversarial training
Transformer network
Adversarial Motion Transformer (AdvMT)
Problem formulation
Motion encoder branch
Temporal continuity discriminator
Loss function
Experiments
Ablation study
Architecture
Loss function
Conclusion

Figures (4)

Figure 1: Left: Overview of our proposed AdvMT network to predict future human motion by observing history motion. Right: The human body joints link structure consisting of human body parts: the torso and head, left leg, right leg, left arm, and right arm.
Figure 2: The architecture of our proposed human motion prediction method primarily comprises of two main branches i.e. motion encoder branch and temporal continuity discriminator. The motion encoder branch, which employs a Transformer encoder layer, is dedicated to learning human motion dynamics. Whereas, the temporal consistency in motion prediction is achieved through our tailored loss function. The bone length error enables the model to maintain consistent bone lengths and adhere to human body constraints over extended periods. Additionally, the discriminator further refines the predicted poses by concentrating on the temporal differences in joint positions. We iteratively use previous predictions as input to forecast future motion, which is particularly effective for long-horizon predictions.
Figure 3: The detailed architecture of our Tansformer-based motion encoder branch. The local and global dependencies within the human body are extracted through multiple layers of attention blocks. Each block aims to learn different aspects of motion dynamics, enabling a comprehensive understanding of human movement.
Figure 4: Qualitative future motion prediction results up to 2 seconds for walking, eating, phoning, and walking together actions from H3.6M dataset. For visualization purposes, the predictions are down-sampled to 5 frames per second. Ground truth poses are drawn in purple and green, whereas the future predictions are marked in blue and red colors. Best visualized in zoomed view.

AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

TL;DR

Abstract

AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)