Table of Contents
Fetching ...

Spatial and social situation-aware transformer-based trajectory prediction of autonomous systems

Kathrin Donandt, Dirk Söffker

TL;DR

The paper tackles target-centric trajectory prediction under spatial and social context by introducing sosp-CT, a transformer-based model that replaces LSTM-based social tensors and map modules with a Social Tensor Transformer and navigation-area–driven dislocation features. By discretizing dislocations and fusing social occupancy grids at each observation step, the approach yields socially aware and spatially grounded predictions without heavy map processing. Ablations on inland vessel data show modest but consistent gains over context-agnostic and spatial-only variants, with substantial interpretability gains through explicit social interactions. The method offers improved efficiency and robustness to partial observability, with potential applicability to road vehicles and ships beyond inland vessels.

Abstract

Autonomous transportation systems such as road vehicles or vessels require the consideration of the static and dynamic environment to dislocate without collision. Anticipating the behavior of an agent in a given situation is required to adequately react to it in time. Developing deep learning-based models has become the dominant approach to motion prediction recently. The social environment is often considered through a CNN-LSTM-based sub-module processing a $\textit{social tensor}$ that includes information of the past trajectory of surrounding agents. For the proposed transformer-based trajectory prediction model, an alternative, computationally more efficient social tensor definition and processing is suggested. It considers the interdependencies between target and surrounding agents at each time step directly instead of relying on information of last hidden LSTM states of individually processed agents. A transformer-based sub-module, the Social Tensor Transformer, is integrated into the overall prediction model. It is responsible for enriching the target agent's dislocation features with social interaction information obtained from the social tensor. For the awareness of spatial limitations, dislocation features are defined in relation to the navigable area. This replaces additional, computationally expensive map processing sub-modules. An ablation study shows, that for longer prediction horizons, the deviation of the predicted trajectory from the ground truth is lower compared to a spatially and socially agnostic model. Even if the performance gain from a spatial-only to a spatial and social context-sensitive model is small in terms of common error measures, by visualizing the results it can be shown that the proposed model in fact is able to predict reactions to surrounding agents and explicitely allows an interpretable behavior.

Spatial and social situation-aware transformer-based trajectory prediction of autonomous systems

TL;DR

The paper tackles target-centric trajectory prediction under spatial and social context by introducing sosp-CT, a transformer-based model that replaces LSTM-based social tensors and map modules with a Social Tensor Transformer and navigation-area–driven dislocation features. By discretizing dislocations and fusing social occupancy grids at each observation step, the approach yields socially aware and spatially grounded predictions without heavy map processing. Ablations on inland vessel data show modest but consistent gains over context-agnostic and spatial-only variants, with substantial interpretability gains through explicit social interactions. The method offers improved efficiency and robustness to partial observability, with potential applicability to road vehicles and ships beyond inland vessels.

Abstract

Autonomous transportation systems such as road vehicles or vessels require the consideration of the static and dynamic environment to dislocate without collision. Anticipating the behavior of an agent in a given situation is required to adequately react to it in time. Developing deep learning-based models has become the dominant approach to motion prediction recently. The social environment is often considered through a CNN-LSTM-based sub-module processing a that includes information of the past trajectory of surrounding agents. For the proposed transformer-based trajectory prediction model, an alternative, computationally more efficient social tensor definition and processing is suggested. It considers the interdependencies between target and surrounding agents at each time step directly instead of relying on information of last hidden LSTM states of individually processed agents. A transformer-based sub-module, the Social Tensor Transformer, is integrated into the overall prediction model. It is responsible for enriching the target agent's dislocation features with social interaction information obtained from the social tensor. For the awareness of spatial limitations, dislocation features are defined in relation to the navigable area. This replaces additional, computationally expensive map processing sub-modules. An ablation study shows, that for longer prediction horizons, the deviation of the predicted trajectory from the ground truth is lower compared to a spatially and socially agnostic model. Even if the performance gain from a spatial-only to a spatial and social context-sensitive model is small in terms of common error measures, by visualizing the results it can be shown that the proposed model in fact is able to predict reactions to surrounding agents and explicitely allows an interpretable behavior.
Paper Structure (10 sections, 2 equations, 6 figures, 1 table)

This paper contains 10 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The sosp-CT model. Embedding layers are not depicted for better readability. The socially-informed embedding sequence obtained from the Social Tensor Transformer (see Fig. \ref{['fig:stt']}) and the concatenation of navigation context and output dislocations embedding are passed to the Transformer to generate the probability distributions over future dislocation labels.
  • Figure 2: Social Tensor Transformer fusing social context and target agent dislocation features at each time step.
  • Figure 3: Schematic example of traffic situations with occupancy grids. Dashed lines enclose the visual range of the target (green), dislocation change rates of surrounding agent $i$ are given by $\Delta_i$, and the target is included in the grid for better understanding.
  • Figure 4: Navigation area-specific (dis)location in- formation. The white area depicts the fairway. Here, $k_i$ is the waterway kilometer distance between the positions $p_{i+1}$ and $p_i$ and $f_i$ the distance from the fairway border.
  • Figure 5: FDE evolution and distribution.
  • ...and 1 more figures