Table of Contents
Fetching ...

Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

Sima Ashayer, Hoang H. Nguyen, Yu Liang, Mina Sartipi

Abstract

Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.

Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

Abstract

Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
Paper Structure (30 sections, 25 equations, 3 figures, 7 tables)

This paper contains 30 sections, 25 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of socially informed, multi-stream architecture for pedestrian crossing intention prediction: Four input streams—attention, positional, situational, and interaction features—are encoded by highway encoders and augmented with stream embeddings. A Transformer model with temporal and cross-stream dependencies and self-attention yields a unified representation. A residual MLP predicts crossing probability, while KL and Mahalanobis modules provide uncertainty and anomaly scores.
  • Figure 2: Cross-stream attention heat-maps and uncertainty scores for two PSI 2.0 samples. Each row shows average multi-head attention from one stream token (attn/pos/sit/inter) to all streams. Left: low model risk (KL = 0.00, Mah = 10.71). Right: high model risk (KL = 0.04, Mah = 2329.78). Cell values are mean attention weights across the four behavioral streams.
  • Figure 3: Four composite qualitative examples showing input frames with a visualization of the behavioural-stream attention weights and uncertainty estimates (KL and Mahalanobis).