AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction
Sarmad Idrees, Jongeun Choi, Seokman Sohn
TL;DR
AdvMT tackles long-term human motion prediction for safe human-robot interaction by coupling a Transformer encoder-based motion model with a temporal continuity discriminator, enabling simultaneous capture of spatial and temporal dependencies and reducing artifacts via adversarial feedback. The approach uses an auto-regressive training regime and a composite loss $\\mathcal{L} = \\\mathcal{L}_{MPJPE} + \\lambda_B \\\mathcal{L}_{bone} + \\lambda_D \\\mathcal{L}_{D_K}$, where $\\mathcal{L}_{MPJPE} = \\\frac{1}{N(T+L)} \\\sum_{t=T+1}^{T+L} \\\sum_{n=1}^N \\\| \\\hat{x}_{t,n} - x_{t,n} \\\|^2$. The temporal discriminator loss $\\mathcal{L}_{D_K}$ enforces plausible joint velocity changes via $\\mathcal{L}_{D_K} = \\\sum_{t=T+1}^{T+L} ( \\\mathbb{E}_{x_t} [ \\\| D_K(\\Delta x_t) \\\|^2 ] + \\\mathbb{E}_{\\hat{x}_t} [ \\\| 1 - D_K(\\Delta \\hat{x}_t) \\\|^2 ] )$, mitigating zero-velocity collapse and reducing error accumulation. Experiments on Human3.6M show AdvMT yields improved long-term accuracy while maintaining strong short-term performance, highlighting its potential for real-time, safe human-robot interaction.
Abstract
To achieve seamless collaboration between robots and humans in a shared environment, accurately predicting future human movements is essential. Human motion prediction has traditionally been approached as a sequence prediction problem, leveraging historical human motion data to estimate future poses. Beginning with vanilla recurrent networks, the research community has investigated a variety of methods for learning human motion dynamics, encompassing graph-based and generative approaches. Despite these efforts, achieving accurate long-term predictions continues to be a significant challenge. In this regard, we present the Adversarial Motion Transformer (AdvMT), a novel model that integrates a transformer-based motion encoder and a temporal continuity discriminator. This combination effectively captures spatial and temporal dependencies simultaneously within frames. With adversarial training, our method effectively reduces the unwanted artifacts in predictions, thereby ensuring the learning of more realistic and fluid human motions. The evaluation results indicate that AdvMT greatly enhances the accuracy of long-term predictions while also delivering robust short-term predictions
