Table of Contents
Fetching ...

Trustworthy Pedestrian Trajectory Prediction via Pattern-Aware Interaction Modeling

Kaiyuan Zhai, Juan Chen, Chao Wang, Zeyi Xu, Guoming Tang

TL;DR

This work tackles trustworthy pedestrian trajectory prediction by addressing the interpretability gap in prior black-box interaction models. It introduces InSyn, a Transformer-based framework with a Pattern-Aware Interaction Encoder, a Trajectory Generator that employs Seq-Start of Seq (SSOS), and a Seq-CVAE for goal sampling, enabling explicit recognition of interaction patterns such as In Sync and Conflict. The approach yields state-of-the-art average ADE on ETH/UCY, demonstrates clear interpretability through case studies, and shows that SSOS reduces initial-step errors by around 6.58%, enhancing stability in sequential predictions. These contributions offer practical benefits for safety-critical applications like autonomous driving by providing both accuracy and transparency in socially aware trajectory forecasting.

Abstract

Accurate and reliable pedestrian trajectory prediction is critical for the application of intelligent applications, yet achieving trustworthy prediction remains highly challenging due to the complexity of interactions among pedestrians. Previous methods often adopt black-box modeling of pedestrian interactions. Despite their strong performance, such opaque modeling limits the reliability of predictions in real-world deployments. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy, termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model not only outperforms recent black-box baselines in prediction accuracy, especially under high-density scenarios, but also provides transparent interaction modeling, as shown in the case study. Furthermore, the SSOS strategy proves to be effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%. Code is avaliable at https://github.com/rickzky1001/InSyn

Trustworthy Pedestrian Trajectory Prediction via Pattern-Aware Interaction Modeling

TL;DR

This work tackles trustworthy pedestrian trajectory prediction by addressing the interpretability gap in prior black-box interaction models. It introduces InSyn, a Transformer-based framework with a Pattern-Aware Interaction Encoder, a Trajectory Generator that employs Seq-Start of Seq (SSOS), and a Seq-CVAE for goal sampling, enabling explicit recognition of interaction patterns such as In Sync and Conflict. The approach yields state-of-the-art average ADE on ETH/UCY, demonstrates clear interpretability through case studies, and shows that SSOS reduces initial-step errors by around 6.58%, enhancing stability in sequential predictions. These contributions offer practical benefits for safety-critical applications like autonomous driving by providing both accuracy and transparency in socially aware trajectory forecasting.

Abstract

Accurate and reliable pedestrian trajectory prediction is critical for the application of intelligent applications, yet achieving trustworthy prediction remains highly challenging due to the complexity of interactions among pedestrians. Previous methods often adopt black-box modeling of pedestrian interactions. Despite their strong performance, such opaque modeling limits the reliability of predictions in real-world deployments. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy, termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model not only outperforms recent black-box baselines in prediction accuracy, especially under high-density scenarios, but also provides transparent interaction modeling, as shown in the case study. Furthermore, the SSOS strategy proves to be effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%. Code is avaliable at https://github.com/rickzky1001/InSyn

Paper Structure

This paper contains 23 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of Interaction Modeling: Previous Methods vs. Our Approach.F represents the interaction effect between pedestrians. In traditional approaches (top), all neighbors of the agent are treated as being in the same state. Our method (bottom) introduces a more refined modeling strategy by considering the specific states of neighboring pedestrians, providing a more nuanced understanding of the neighboring interaction.
  • Figure 2: Left is the illustration of the input $S_{0:\tau}$. At each time step, the walking state $S_k$ comprises the 2D coordinates $(x, y)$ and interaction information $N_k$. Right demonstrates the scenarios of In Sync, Conflict and No Neighbor state. Note that the 4-regions partition and the scenarios identification use simple spatial-temporal rules for transparency, more complex methods could be applied here but beyond our scope.
  • Figure 3: Overview of the InSyn framework for trajectory prediction. Our model consists of three key modules: (1) Interaction Encoder, (2) Trajectory Generator, and (3) Seq-CVAE. The input observed walking state includes the agent's trajectory positions $pos_{0:\tau}$ and its interaction information $N_{0:\tau}$ within the observed time $0:\tau$.
  • Figure 4: Seq-CVAE Architecture. Flatten represents flattening the input to a one-dimensional vector; MLP refers to the multi-layer perceptron, and $[a,b,c]$ above it indicates the dimensional transformations across its layers; $\copyright$ represents concatenation; $\mu$ and $\sigma$ represent the mean and standard deviation of the latent variable $z$. During training, the reparameterization trick kingma2013auto is employed to enable backpropagation.
  • Figure 5: Case Study of Region Partition and Interaction State. This figure compares three variants of InSyn: (1) without region partition (w/o-RP), (2) without interaction state (w/o-IS), and (3) the full model (InSyn). To isolate the effect of interaction modeling, we exclude the Goal Sampler in this evaluation.