Table of Contents
Fetching ...

SocialFormer: Social Interaction Modeling with Edge-enhanced Heterogeneous Graph Transformers for Trajectory Prediction

Zixu Wang, Zhigang Sun, Juergen Luettin, Lavdim Halilaj

TL;DR

SocialFormer addresses autonomous driving trajectory prediction by modeling rich social interactions and road topology. It introduces an edge-enhanced heterogeneous graph transformer (EHGT) to encode edge attributes within a heterogeneous scene graph, coupled with a GRU-based temporal encoder and a four-part information fusion module to form a comprehensive scene representation. A multimodal trajectory predictor samples multiple future paths using a Gaussian latent variable $z$ and produces $k$ trajectories $\,\hat{Y}_{1:t_f}^{k}$, complemented by a graph-based prediction $\,\tilde{Y}_{1:t_f}^{k}$; losses combine $r$ and $ r$ under $m$. Experiments on the nuScenes benchmark demonstrate state-of-the-art accuracy, including robustness in scenes with sparse semantic relations, underscoring the value of explicit agent interactions and lane topology in real-world driving settings.

Abstract

Accurate trajectory prediction is crucial for ensuring safe and efficient autonomous driving. However, most existing methods overlook complex interactions between traffic participants that often govern their future trajectories. In this paper, we propose SocialFormer, an agent interaction-aware trajectory prediction method that leverages the semantic relationship between the target vehicle and surrounding vehicles by making use of the road topology. We also introduce an edge-enhanced heterogeneous graph transformer (EHGT) as the aggregator in a graph neural network (GNN) to encode the semantic and spatial agent interaction information. Additionally, we introduce a temporal encoder based on gated recurrent units (GRU) to model the temporal social behavior of agent movements. Finally, we present an information fusion framework that integrates agent encoding, lane encoding, and agent interaction encoding for a holistic representation of the traffic scene. We evaluate SocialFormer for the trajectory prediction task on the popular nuScenes benchmark and achieve state-of-the-art performance.

SocialFormer: Social Interaction Modeling with Edge-enhanced Heterogeneous Graph Transformers for Trajectory Prediction

TL;DR

SocialFormer addresses autonomous driving trajectory prediction by modeling rich social interactions and road topology. It introduces an edge-enhanced heterogeneous graph transformer (EHGT) to encode edge attributes within a heterogeneous scene graph, coupled with a GRU-based temporal encoder and a four-part information fusion module to form a comprehensive scene representation. A multimodal trajectory predictor samples multiple future paths using a Gaussian latent variable and produces trajectories , complemented by a graph-based prediction ; losses combine and under . Experiments on the nuScenes benchmark demonstrate state-of-the-art accuracy, including robustness in scenes with sparse semantic relations, underscoring the value of explicit agent interactions and lane topology in real-world driving settings.

Abstract

Accurate trajectory prediction is crucial for ensuring safe and efficient autonomous driving. However, most existing methods overlook complex interactions between traffic participants that often govern their future trajectories. In this paper, we propose SocialFormer, an agent interaction-aware trajectory prediction method that leverages the semantic relationship between the target vehicle and surrounding vehicles by making use of the road topology. We also introduce an edge-enhanced heterogeneous graph transformer (EHGT) as the aggregator in a graph neural network (GNN) to encode the semantic and spatial agent interaction information. Additionally, we introduce a temporal encoder based on gated recurrent units (GRU) to model the temporal social behavior of agent movements. Finally, we present an information fusion framework that integrates agent encoding, lane encoding, and agent interaction encoding for a holistic representation of the traffic scene. We evaluate SocialFormer for the trajectory prediction task on the popular nuScenes benchmark and achieve state-of-the-art performance.
Paper Structure (27 sections, 19 equations, 5 figures, 5 tables)

This paper contains 27 sections, 19 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Schematic illustration of the traffic scene. The future trajectory of a vehicle depends on many factors, including agent states, road topology with possible driving directions and lane changes, and the influence of nearby vehicles. The latter includes past trajectory, relation type, distance, speed, and right of way, among others.
  • Figure 2: Overview of proposed SocialFormer. Agent states, lane graphs, and the interactions between agents in the form of heterogeneous graphs are encoded by the specific encoder. An information fusion module is used to generate the holistic latent representation of the traffic scene. Finally, the predictor outputs possible future trajectories of the target agent.
  • Figure 3: Illustration of Agent Interaction Dynamic Heterogeneous Graph Encoder: It consists of three parts: specific type encoder, EHGT, and temporal encoder.
  • Figure 4: Information Fusion Module: Four sub-modules combine different encodings to form a comprehensive encoding for the predictor. The output is $f_{fused}$.
  • Figure 5: Illustration of the qualitative result in various traffic scenarios. Left column: HD maps and tracks. Middle column: Top 10 most likely predictions. Right column: Ground truth.