Table of Contents
Fetching ...

PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly Interactive Extreme Motion Prediction

Yanwen Fang, Jintai Chen, Peng-Tao Jiang, Chao Li, Yifeng Geng, Eddy K. F. Lam, Guodong Li

TL;DR

This work targets multi-person pose forecasting under highly interactive extreme motions and introduces PGformer, a Transformer-based model enhanced with a cross-query attention (XQA) module and a proxy bridge to capture bidirectional dependencies between two interacting individuals. The XQA enables the leader and follower poses to share a unified attention mechanism, while the proxy aggregates spatial semantics to subtly steer information flow across time, improving both short- and long-term predictions. PGformer employs a non-autoregressive decoder, a DCT-based pose encoder, a GCN-based decoder, and a gravity loss to stabilize the center-of-gravity dynamics, demonstrating strong results on the ExPI dataset and good generalization to CMU-Mocap and MuPoTS-3D. Overall, the method advances robust, scalable multi-person motion prediction for real-world highly interactive scenarios, with practical impact for robotics, autonomous systems, and human-robot interaction benchmarks.

Abstract

Multi-person motion prediction is a challenging task, especially for real-world scenarios of highly interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., walking together), in which typically forecasting each human pose in isolation can still achieve good performances. This paper focuses on collaborative motion prediction for multiple persons with extreme motions and attempts to explore the relationships between the highly interactive persons' pose trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. A proxy unit is additionally introduced to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional spatial information flows. These designs are then integrated into a Transformer-based architecture and the resulting model is called Proxy-bridged Game Transformer (PGformer) for multi-person interactive motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. Our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and extended to the case of more than 2 individuals with encouraging results.

PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly Interactive Extreme Motion Prediction

TL;DR

This work targets multi-person pose forecasting under highly interactive extreme motions and introduces PGformer, a Transformer-based model enhanced with a cross-query attention (XQA) module and a proxy bridge to capture bidirectional dependencies between two interacting individuals. The XQA enables the leader and follower poses to share a unified attention mechanism, while the proxy aggregates spatial semantics to subtly steer information flow across time, improving both short- and long-term predictions. PGformer employs a non-autoregressive decoder, a DCT-based pose encoder, a GCN-based decoder, and a gravity loss to stabilize the center-of-gravity dynamics, demonstrating strong results on the ExPI dataset and good generalization to CMU-Mocap and MuPoTS-3D. Overall, the method advances robust, scalable multi-person motion prediction for real-world highly interactive scenarios, with practical impact for robotics, autonomous systems, and human-robot interaction benchmarks.

Abstract

Multi-person motion prediction is a challenging task, especially for real-world scenarios of highly interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., walking together), in which typically forecasting each human pose in isolation can still achieve good performances. This paper focuses on collaborative motion prediction for multiple persons with extreme motions and attempts to explore the relationships between the highly interactive persons' pose trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. A proxy unit is additionally introduced to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional spatial information flows. These designs are then integrated into a Transformer-based architecture and the resulting model is called Proxy-bridged Game Transformer (PGformer) for multi-person interactive motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. Our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and extended to the case of more than 2 individuals with encouraging results.
Paper Structure (34 sections, 10 equations, 11 figures, 7 tables)

This paper contains 34 sections, 10 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Weakly interacted motions vs highly interacted extreme motions.
  • Figure 2: Illustrations of our cross-query attention (XQA) module with a proxy, where 'SM' and 'Matmul' indicate $Softmax$ and matrix multiplication. ⓒ denotes channel-wise concatenation.
  • Figure 3: Overview of our PGformer's architecture for multi-person highly interactive extreme motion prediction. $\oplus$ and ⓒ represent broadcast element-wise addition and concatenation respectively, and PE means positional encoding. T denotes the template matrix used to construct proxy in the encoder layer, and the proxy in the decoder layer is built by the predicted future templates. The left bottom is a schematic diagram of a PGformer layer, including a standard Transformer layer (MHA + FFN) and a subsequent XQA module with proxy.
  • Figure 4: Percentages of improvement of our PGformer compared with other methods at different forecasting time, on the common action split, which are measured by taking the average of the percentages of improvement of average JME and AME error.
  • Figure 5: Average performance gains over XIA and HRI of joint-wise JME on ExPI. Darker color means larger performance gains.
  • ...and 6 more figures