Table of Contents
Fetching ...

Poly-Autoregressive Prediction for Modeling Interactions

Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, Jitendra Malik

TL;DR

The paper addresses the challenge of predicting agent behavior in multi-agent physical settings where interactions are governed by physics and internal motivations. It introduces Poly-Autoregressive (PAR) modeling, a transformer-based framework that predicts an ego agent's future states conditioned on its history and the history of other agents, by representing the scene as a sequence of tokens across all agents and timesteps. Across three real-world tasks—social action forecasting (AVA), multi-agent car trajectory prediction (nuScenes), and hand-object pose forecasting (DexYCB)—PAR consistently outperforms single-agent autoregressive baselines, with notable gains in mAP for two-person actions, ADE/FDE for trajectories, and rotation/translation errors for poses. The results suggest that incorporating multi-agent context in a unified, simple framework can substantially improve predictive accuracy in diverse interactive domains, with broad implications for navigation, human-robot interaction, and autonomous systems.

Abstract

We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at https://neerja.me/PAR/.

Poly-Autoregressive Prediction for Modeling Interactions

TL;DR

The paper addresses the challenge of predicting agent behavior in multi-agent physical settings where interactions are governed by physics and internal motivations. It introduces Poly-Autoregressive (PAR) modeling, a transformer-based framework that predicts an ego agent's future states conditioned on its history and the history of other agents, by representing the scene as a sequence of tokens across all agents and timesteps. Across three real-world tasks—social action forecasting (AVA), multi-agent car trajectory prediction (nuScenes), and hand-object pose forecasting (DexYCB)—PAR consistently outperforms single-agent autoregressive baselines, with notable gains in mAP for two-person actions, ADE/FDE for trajectories, and rotation/translation errors for poses. The results suggest that incorporating multi-agent context in a unified, simple framework can substantially improve predictive accuracy in diverse interactive domains, with broad implications for navigation, human-robot interaction, and autonomous systems.

Abstract

We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. In contrast to autoregressive (AR) tasks, such as language processing, our focus is on scenarios with multiple agents whose interactions are shaped by physical constraints and internal motivations. To this end, we propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior by reasoning about the ego agent's state history and the past and current states of other interacting agents. At its core, PAR represents the behavior of all agents as a sequence of tokens, each representing an agent's state at a specific timestep. With minimal data pre-processing changes, we show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction. Using a small proof-of-concept transformer backbone, PAR outperforms AR across these three scenarios. The project website can be found at https://neerja.me/PAR/.

Paper Structure

This paper contains 31 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 2: The PAR Framework. We begin by collecting a video dataset, such as AVA (top) or DexYCB (bottom). Then, using dataset labels or computer vision techniques, a trajectory of a given modality for our prediction task is extracted for each agent, such as action class labels (top) or object pose and 3D hand translation (bottom). Data is then tokenized, either through discretization or directly using continuous values, with our framework supporting both formats. Based on the tokenization and prediction task, we choose the appropriate loss function for PAR training. After training with PAR, predicted tokens can be decoded back to data space and evaluated with relevant metrics.
  • Figure 3: Training with teacher forcing for (a) multi-agent next-token prediction in autoregressive models and (b) multi-agent poly-autoregressive models. Solid vs striped indicates a ground-truth vs predicted token, respectively. Color denotes agent identity. The AR model is trained for next-token prediction, while the PAR model is trained to predict the next timestep of the same agent. Three agents are shown for ease of visualization, but the PAR model supports an arbitrary number of agents.
  • Figure 4: Action forecasting example. The distribution over ground truth actions are in white, and our predictions in red. A 6s action history (1Hz) is input, and 6s of future actions predicted. In the scene, the man and woman alternate between talking and listening. Initially, the man is talking. The AR model predicts he will continue talking, while the 2-agent PAR model recognizes the woman is talking and predicts more accurate turn-taking behavior.
  • Figure 5: Per-class mAP for AVA 2-person actions. We see performance improvement on almost all 2-person AVA action classes ((P) stands for "a person"). Some absolute mAP gains are particularly significant: listen to$+7.0$, kiss$+8.3$, fight/hit$+5.7$, talk to$+4.4$, hug$+5.7$, and hand shake$+4.0$.
  • Figure 6: Example results from our single-agent AR model (top row) and three-agent PAR model with location positional encoding (bottom row) on nuScenes. The predicted agent's ground truth trajectory is in pink, and the predicted future in blue. For the PAR model, the other two agents' ground truth states are in green. Qualitatively, the PAR model handles situations where single-agent predictions might lead to collisions (A, B), uses other agents' behavior to better adhere to road areas (A, C) without environment data, and predicts based on the speed changes of other cars (D).
  • ...and 7 more figures