Table of Contents
Fetching ...

Robust Human Motion Forecasting using Transformer-based Model

Esteve Valls Mascaro, Shuo Ma, Hyemin Ahn, Dongheui Lee

TL;DR

The paper tackles robust, real-time 3D human motion forecasting for robotic systems by introducing 2CH-TR, a transformer with decoupled temporal and spatial channels. It processes a short observed prefix (400 ms) and produces a full 1-second future sequence in a single shot, while modeling global rotation and translation. Compared with state-of-the-art baselines, 2CH-TR offers competitive accuracy with reduced computation and model size, and demonstrates strong occlusion robustness through dedicated reconstruction analyses. The work demonstrates practical potential for robot–human collaboration in noisy, partially observed environments, validated on Human3.6M and real-world demonstrations.

Abstract

Comprehending human motion is a fundamental challenge for developing Human-Robot Collaborative applications. Computer vision researchers have addressed this field by only focusing on reducing error in predictions, but not taking into account the requirements to facilitate its implementation in robots. In this paper, we propose a new model based on Transformer that simultaneously deals with the real time 3D human motion forecasting in the short and long term. Our 2-Channel Transformer (2CH-TR) is able to efficiently exploit the spatio-temporal information of a shortly observed sequence (400ms) and generates a competitive accuracy against the current state-of-the-art. 2CH-TR stands out for the efficient performance of the Transformer, being lighter and faster than its competitors. In addition, our model is tested in conditions where the human motion is severely occluded, demonstrating its robustness in reconstructing and predicting 3D human motion in a highly noisy environment. Our experiment results show that the proposed 2CH-TR outperforms the ST-Transformer, which is another state-of-the-art model based on the Transformer, in terms of reconstruction and prediction under the same conditions of input prefix. Our model reduces in 8.89% the mean squared error of ST-Transformer in short-term prediction, and 2.57% in long-term prediction in Human3.6M dataset with 400ms input prefix. Webpage: https://evm7.github.io/2CHTR-page/

Robust Human Motion Forecasting using Transformer-based Model

TL;DR

The paper tackles robust, real-time 3D human motion forecasting for robotic systems by introducing 2CH-TR, a transformer with decoupled temporal and spatial channels. It processes a short observed prefix (400 ms) and produces a full 1-second future sequence in a single shot, while modeling global rotation and translation. Compared with state-of-the-art baselines, 2CH-TR offers competitive accuracy with reduced computation and model size, and demonstrates strong occlusion robustness through dedicated reconstruction analyses. The work demonstrates practical potential for robot–human collaboration in noisy, partially observed environments, validated on Human3.6M and real-world demonstrations.

Abstract

Comprehending human motion is a fundamental challenge for developing Human-Robot Collaborative applications. Computer vision researchers have addressed this field by only focusing on reducing error in predictions, but not taking into account the requirements to facilitate its implementation in robots. In this paper, we propose a new model based on Transformer that simultaneously deals with the real time 3D human motion forecasting in the short and long term. Our 2-Channel Transformer (2CH-TR) is able to efficiently exploit the spatio-temporal information of a shortly observed sequence (400ms) and generates a competitive accuracy against the current state-of-the-art. 2CH-TR stands out for the efficient performance of the Transformer, being lighter and faster than its competitors. In addition, our model is tested in conditions where the human motion is severely occluded, demonstrating its robustness in reconstructing and predicting 3D human motion in a highly noisy environment. Our experiment results show that the proposed 2CH-TR outperforms the ST-Transformer, which is another state-of-the-art model based on the Transformer, in terms of reconstruction and prediction under the same conditions of input prefix. Our model reduces in 8.89% the mean squared error of ST-Transformer in short-term prediction, and 2.57% in long-term prediction in Human3.6M dataset with 400ms input prefix. Webpage: https://evm7.github.io/2CHTR-page/
Paper Structure (15 sections, 4 equations, 9 figures, 4 tables)

This paper contains 15 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An overview of 3D human motion forecasting in occluded environments. The red lines are the observed 3D skeletons projected into the image, while the blue lines consist of random occluded limbs to test model's 3D pose-reconstruction capacity. Finally, the green skeletons represent the predicted human pose sequence in the near future.
  • Figure 2: Spatio-temporal graph of joint dependencies for human motion. The blue arrows refer to temporal relationships between the same joint parameters in different frames. The orange arrows imply the spatial relationship between joints in the same frame.
  • Figure 3: Architecture of 2-Channel Transformer (2CH-TR). The observed skeleton motion sequence $X$ is projected independently for each channel into an embedding space ($E_S$ and $E_T$) and then positional encoding is injected. Each embedding is fed into $L$ stacked attention layers that extracts dependencies between the sequence using multi-head attention. Finally, each embedding ($\hat{E}_S$ and $\hat{E}_T$) is decoded and projected back to skeleton sequences. Future poses ($\hat{X}_{pred}$) are then the result of summing the output of each channel ($\hat{X}_S$ and $\hat{X}_T$) with the residual connection $X$ from input to output.
  • Figure 4: Input pattern representation for our 2CH-TR, with $N=10$ poses observed by the model and $T'=25$ future poses to be predicted. In the prefix, last $T'$ poses are repeated from last observed pose, so that the estimation only focuses on forecasting the difference between the future pose and final pose.
  • Figure 5: Temporal channel mechanism to exploit relationships of $P$ skeleton parameters between $N$ frames. Attention is used to capture time dependencies in the projected embedding space. For simplification in the visualization, only a historical of $T=2$ poses ($x_1$ and $x_2$) are used.
  • ...and 4 more figures