Table of Contents
Fetching ...

STPOTR: Simultaneous Human Trajectory and Pose Prediction Using a Non-Autoregressive Transformer for Robot Following Ahead

Mohammad Mahdavian, Payam Nikdel, Mahdi TaherAhmadi, Mo Chen

TL;DR

The paper tackles robot follow-ahead by predicting both future human 3D body pose and hip trajectory from observed motion. It introduces a non-autoregressive transformer with two parallel prediction heads for pose and trajectory, augmented by a Shared Attention module and an End Attention mechanism to strengthen cross-task learning and temporal modeling. Empirical results on Human3.6M show competitive pose accuracy and improved trajectory prediction, while achieving faster inference suitable for real-time robotics, demonstrated in real-world follow-ahead experiments with a ZED2 camera and a Turtlebot2. The work demonstrates that jointly modeling pose and trajectory improves performance and enables richer follow behaviors, with ablations validating the contribution of the shared and end-attention components.

Abstract

In this paper, we develop a neural network model to predict future human motion from an observed human motion history. We propose a non-autoregressive transformer architecture to leverage its parallel nature for easier training and fast, accurate predictions at test time. The proposed architecture divides human motion prediction into two parts: 1) the human trajectory, which is the hip joint 3D position over time and 2) the human pose which is the all other joints 3D positions over time with respect to a fixed hip joint. We propose to make the two predictions simultaneously, as the shared representation can improve the model performance. Therefore, the model consists of two sets of encoders and decoders. First, a multi-head attention module applied to encoder outputs improves human trajectory. Second, another multi-head self-attention module applied to encoder outputs concatenated with decoder outputs facilitates learning of temporal dependencies. Our model is well-suited for robotic applications in terms of test accuracy and speed, and compares favorably with respect to state-of-the-art methods. We demonstrate the real-world applicability of our work via the Robot Follow-Ahead task, a challenging yet practical case study for our proposed model.

STPOTR: Simultaneous Human Trajectory and Pose Prediction Using a Non-Autoregressive Transformer for Robot Following Ahead

TL;DR

The paper tackles robot follow-ahead by predicting both future human 3D body pose and hip trajectory from observed motion. It introduces a non-autoregressive transformer with two parallel prediction heads for pose and trajectory, augmented by a Shared Attention module and an End Attention mechanism to strengthen cross-task learning and temporal modeling. Empirical results on Human3.6M show competitive pose accuracy and improved trajectory prediction, while achieving faster inference suitable for real-time robotics, demonstrated in real-world follow-ahead experiments with a ZED2 camera and a Turtlebot2. The work demonstrates that jointly modeling pose and trajectory improves performance and enables richer follow behaviors, with ablations validating the contribution of the shared and end-attention components.

Abstract

In this paper, we develop a neural network model to predict future human motion from an observed human motion history. We propose a non-autoregressive transformer architecture to leverage its parallel nature for easier training and fast, accurate predictions at test time. The proposed architecture divides human motion prediction into two parts: 1) the human trajectory, which is the hip joint 3D position over time and 2) the human pose which is the all other joints 3D positions over time with respect to a fixed hip joint. We propose to make the two predictions simultaneously, as the shared representation can improve the model performance. Therefore, the model consists of two sets of encoders and decoders. First, a multi-head attention module applied to encoder outputs improves human trajectory. Second, another multi-head self-attention module applied to encoder outputs concatenated with decoder outputs facilitates learning of temporal dependencies. Our model is well-suited for robotic applications in terms of test accuracy and speed, and compares favorably with respect to state-of-the-art methods. We demonstrate the real-world applicability of our work via the Robot Follow-Ahead task, a challenging yet practical case study for our proposed model.
Paper Structure (21 sections, 4 figures, 3 tables)

This paper contains 21 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Robot follow-ahead via human motion prediction
  • Figure 2: Our model architecture predicts both human poses and trajectories concurrently based on an observed 3D human joint sequence. It consists of two non-autoregressive transformers for pose and trajectory predictions, with a shared Attention module that enhances the quality of predictions by facilitating the exchange of knowledge between the two. To model the temporal dependencies more effectively, an End Attention module is added to the end of each decoder. The blue-colored frames show the input sequence or frame and the red ones show the output. The rectangular frames show that the same frame (last input pose) is copied and used as the decoder input sequence and as a residual for decoder output.
  • Figure 3: Three samples of the predicted motion vs. ground truth. On each couple of figures (a to c) the left one shows the predicted motion given an observed sequence and the right one shows the ground truth. The blue-colored skeletons show the input sequence and the red and green ones show the model predictions and ground truth, respectively. Also, the trajectory of the hip is shown with dashed blue lines.
  • Figure 4: Three samples of the robot follow-ahead tasks for U-Shaped, S-Shaped and straight line scenarios. The triangle and arrows show the human and robot motions, respectively.