Table of Contents
Fetching ...

SceneMotion: From Agent-Centric Embeddings to Scene-Wide Forecasts

Royden Wagner, Ömer Sahin Tas, Marlon Steiner, Fabian Konstantinidis, Hendrik Königshof, Marvin Klemp, Carlos Fernandez, Christoph Stiller

TL;DR

SceneMotion tackles scene-wide forecasting of joint trajectories for multiple traffic agents in driving environments by transforming local agent-centric embeddings into a global scene-wide latent space and decoding this into six joint motion modes. The method leverages an attention-based latent context module and an anchor-based decoder to capture interactions among up to eight focal agents with dense context, achieving strong performance on the Waymo Open Motion and Interaction Prediction benchmarks, while offering a waypoint-clustering analysis to quantify inter-agent interactions. Key contributions include data-efficient agent-centric representations, a global latent context for joint forecasting, and a quantitative interpretability tool that assesses whether predicted interactions resolve potential conflicts. The approach has practical impact for planning in autonomous driving, as it provides both accurate scene-wide forecasts and a mechanism to identify and reason about potential future interactions.

Abstract

Self-driving vehicles rely on multimodal motion forecasts to effectively interact with their environment and plan safe maneuvers. We introduce SceneMotion, an attention-based model for forecasting scene-wide motion modes of multiple traffic agents. Our model transforms local agent-centric embeddings into scene-wide forecasts using a novel latent context module. This module learns a scene-wide latent space from multiple agent-centric embeddings, enabling joint forecasting and interaction modeling. The competitive performance in the Waymo Open Interaction Prediction Challenge demonstrates the effectiveness of our approach. Moreover, we cluster future waypoints in time and space to quantify the interaction between agents. We merge all modes and analyze each mode independently to determine which clusters are resolved through interaction or result in conflict. Our implementation is available at: https://github.com/kit-mrt/future-motion

SceneMotion: From Agent-Centric Embeddings to Scene-Wide Forecasts

TL;DR

SceneMotion tackles scene-wide forecasting of joint trajectories for multiple traffic agents in driving environments by transforming local agent-centric embeddings into a global scene-wide latent space and decoding this into six joint motion modes. The method leverages an attention-based latent context module and an anchor-based decoder to capture interactions among up to eight focal agents with dense context, achieving strong performance on the Waymo Open Motion and Interaction Prediction benchmarks, while offering a waypoint-clustering analysis to quantify inter-agent interactions. Key contributions include data-efficient agent-centric representations, a global latent context for joint forecasting, and a quantitative interpretability tool that assesses whether predicted interactions resolve potential conflicts. The approach has practical impact for planning in autonomous driving, as it provides both accurate scene-wide forecasts and a mechanism to identify and reason about potential future interactions.

Abstract

Self-driving vehicles rely on multimodal motion forecasts to effectively interact with their environment and plan safe maneuvers. We introduce SceneMotion, an attention-based model for forecasting scene-wide motion modes of multiple traffic agents. Our model transforms local agent-centric embeddings into scene-wide forecasts using a novel latent context module. This module learns a scene-wide latent space from multiple agent-centric embeddings, enabling joint forecasting and interaction modeling. The competitive performance in the Waymo Open Interaction Prediction Challenge demonstrates the effectiveness of our approach. Moreover, we cluster future waypoints in time and space to quantify the interaction between agents. We merge all modes and analyze each mode independently to determine which clusters are resolved through interaction or result in conflict. Our implementation is available at: https://github.com/kit-mrt/future-motion
Paper Structure (6 sections, 3 figures, 7 tables)

This paper contains 6 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: SceneMotion. Our attention-based motion forecasting model is composed of stacked encoder and decoder modules. Variable-sized agent-centric views $V_i$ are reduced to fixed-sized agent-centric embeddings $E_i$ via cross-attention with road environment descriptor (RED) tokens $R_j$. Afterwards, we concatenate the agent-centric embeddings with global reference tokens $G_i$ and rearrange them to form a scene-wide embedding. Our latent context module then learns global context and our motion decoder transforms learned anchors $A_k$ into scene-wide forecasts. We show a simplified example with only two focal agents. By default, our model forecasts motion for 8 focal agents each with 48 context agents, enabling interaction modeling in complex scenarios.
  • Figure 2: Scene-wide motion forecasts. Our model forecasts scene-wide motion modes by modeling joint distributions of trajectories for 8 focal agents. Dynamic agents are shown in blue, static agents in grey (determined at $t=0\,\text{s}$). Lanes are black lines and road markings are white lines.
  • Figure 3: Analyzing scene-wide motion forecasts in terms of waypoint clusters. We show an example of a potential interaction of widely separated vehicles to demonstrate the benefits of a long prediction horizon of 8 s.