PGformer: Proxy-Bridged Game Transformer for Multi-Person Highly Interactive Extreme Motion Prediction
Yanwen Fang, Jintai Chen, Peng-Tao Jiang, Chao Li, Yifeng Geng, Eddy K. F. Lam, Guodong Li
TL;DR
This work targets multi-person pose forecasting under highly interactive extreme motions and introduces PGformer, a Transformer-based model enhanced with a cross-query attention (XQA) module and a proxy bridge to capture bidirectional dependencies between two interacting individuals. The XQA enables the leader and follower poses to share a unified attention mechanism, while the proxy aggregates spatial semantics to subtly steer information flow across time, improving both short- and long-term predictions. PGformer employs a non-autoregressive decoder, a DCT-based pose encoder, a GCN-based decoder, and a gravity loss to stabilize the center-of-gravity dynamics, demonstrating strong results on the ExPI dataset and good generalization to CMU-Mocap and MuPoTS-3D. Overall, the method advances robust, scalable multi-person motion prediction for real-world highly interactive scenarios, with practical impact for robotics, autonomous systems, and human-robot interaction benchmarks.
Abstract
Multi-person motion prediction is a challenging task, especially for real-world scenarios of highly interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., walking together), in which typically forecasting each human pose in isolation can still achieve good performances. This paper focuses on collaborative motion prediction for multiple persons with extreme motions and attempts to explore the relationships between the highly interactive persons' pose trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. A proxy unit is additionally introduced to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional spatial information flows. These designs are then integrated into a Transformer-based architecture and the resulting model is called Proxy-bridged Game Transformer (PGformer) for multi-person interactive motion prediction. Its effectiveness has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. Our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and extended to the case of more than 2 individuals with encouraging results.
