Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
Jaewoo Jeong, Daehee Park, Kuk-Jin Yoon
TL;DR
This work tackles long-term multi-agent 3D human pose forecasting by decoupling global trajectories from local poses and predicting multiple global trajectory modes before conditioning mode-specific local poses. The proposed Trajectory2Pose (T2P) framework uses an interaction-aware Traj-Pose module to model inter-agent relations efficiently via graph attention, enabling reciprocal refinement of global intent and local motion. A multi-stage training objective combines $L_2$ losses on both global trajectories and local poses, while integrating a novel JRDB-GMP dataset with up to 24 agents over 5 seconds to evaluate performance in realistic scenarios. The method achieves state-of-the-art results across classic benchmarks and the new dataset, demonstrating strong generalization to complex real-world multi-agent interactions and providing a practical pathway for long-horizon human motion understanding and safety-critical applications.
Abstract
Human pose forecasting garners attention for its diverse applications. However, challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist, particularly with longer timescales and more agents. In this paper, we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model, utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted, followed by respective local pose forecasts conditioned on each mode. In doing so, our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions, improving performance in complex environments. Furthermore, we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations, enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at https://github.com/Jaewoo97/T2P.
