Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

Jaewoo Jeong; Daehee Park; Kuk-Jin Yoon

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

Jaewoo Jeong, Daehee Park, Kuk-Jin Yoon

TL;DR

This work tackles long-term multi-agent 3D human pose forecasting by decoupling global trajectories from local poses and predicting multiple global trajectory modes before conditioning mode-specific local poses. The proposed Trajectory2Pose (T2P) framework uses an interaction-aware Traj-Pose module to model inter-agent relations efficiently via graph attention, enabling reciprocal refinement of global intent and local motion. A multi-stage training objective combines $L_2$ losses on both global trajectories and local poses, while integrating a novel JRDB-GMP dataset with up to 24 agents over 5 seconds to evaluate performance in realistic scenarios. The method achieves state-of-the-art results across classic benchmarks and the new dataset, demonstrating strong generalization to complex real-world multi-agent interactions and providing a practical pathway for long-horizon human motion understanding and safety-critical applications.

Abstract

Human pose forecasting garners attention for its diverse applications. However, challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist, particularly with longer timescales and more agents. In this paper, we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model, utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted, followed by respective local pose forecasts conditioned on each mode. In doing so, our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions, improving performance in complex environments. Furthermore, we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations, enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at https://github.com/Jaewoo97/T2P.

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

TL;DR

losses on both global trajectories and local poses, while integrating a novel JRDB-GMP dataset with up to 24 agents over 5 seconds to evaluate performance in realistic scenarios. The method achieves state-of-the-art results across classic benchmarks and the new dataset, demonstrating strong generalization to complex real-world multi-agent interactions and providing a practical pathway for long-horizon human motion understanding and safety-critical applications.

Abstract

Paper Structure (26 sections, 6 equations, 6 figures, 6 tables)

This paper contains 26 sections, 6 equations, 6 figures, 6 tables.

Introduction
Related works
Human pose forecasting
Trajectory prediction
Human pose estimation from image
Method
Problem definition
Overall framework
Model structure
Pose encoder
Trajectory module
Traj-pose module
Trajectory decoder
Pose decoder
Training objective
...and 11 more sections

Figures (6)

Figure 1: Human motion is goal-directed and influenced by other entities. Therefore, global intention contains hints for local intention, allowing us to infer local pose from global trajectories. Our method first forecasts global trajectories, upon which local poses are conditioned for subsequent forecasts. Pose and trajectory-wise inter-agent interactions are considered for both predictions.
Figure 2: Illustration of our T2P framework. We decompose global motion into global trajectory and local pose. Multi-modal global trajectory proposals are predicted from past global trajectory and local pose embeddings. Then, future local poses are conditioned and forecasted on each trajectory proposal to compose the final human pose prediction. Predicted local poses are added to their mode-specific global trajectories in a joint-wise manner, obtaining the global human poses as the final output.
Figure 3: Example scenes from the JRDB-GMP dataset, illustrating its long-term, multi-agent nature.
Figure 4: Various motions from the JRDB-MultiGlobPose dataset, providing rich motion queues for inter-agent interaction inference.
Figure 5: Visualization of a long-term forecasting scene from JRDB-GMP (2/5) dataset. Past poses for input are shown on the leftmost column, GT future poses on the next, and forecasts by ours, MRT, TBIFormer, and JRT, respectively.
...and 1 more figures

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

TL;DR

Abstract

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)