Table of Contents
Fetching ...

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

Sirui Wang, Zhou Guan, Bingxi Zhao, Tongjia Gu, Jie Liu

TL;DR

CaTFormer tackles driving intention prediction by explicitly modeling causal interactions between driver state and environmental context. It introduces three core modules—Reciprocal Delayed Fusion for temporal cross-stream alignment, Counterfactual Residual Encoding to disentangle genuine causal effects, and a Feature Synthesis Network for adaptive fusion—within a Transformer framework to achieve robust, interpretable predictions. On Brain4Cars, CaTFormer sets new benchmarks, demonstrating strong performance across highway and urban maneuvers and providing visualizations that reveal its causal reasoning. The approach offers practical impact for real-time human-machine co-driving by improving safety, reliability, and transparency in maneuver anticipation.

Abstract

Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatiotemporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaTFormer, a causal Temporal Transformer that explicitly models causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaTFormer introduces a novel Reciprocal Delayed Fusion (RDF) mechanism for precise temporal alignment of interior and exterior feature streams, a Counterfactual Residual Encoding (CRE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent temporal representations. Experimental results demonstrate that CaTFormer attains state-of-the-art performance on the Brain4Cars dataset. It effectively captures complex causal temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

TL;DR

CaTFormer tackles driving intention prediction by explicitly modeling causal interactions between driver state and environmental context. It introduces three core modules—Reciprocal Delayed Fusion for temporal cross-stream alignment, Counterfactual Residual Encoding to disentangle genuine causal effects, and a Feature Synthesis Network for adaptive fusion—within a Transformer framework to achieve robust, interpretable predictions. On Brain4Cars, CaTFormer sets new benchmarks, demonstrating strong performance across highway and urban maneuvers and providing visualizations that reveal its causal reasoning. The approach offers practical impact for real-time human-machine co-driving by improving safety, reliability, and transparency in maneuver anticipation.

Abstract

Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatiotemporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaTFormer, a causal Temporal Transformer that explicitly models causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaTFormer introduces a novel Reciprocal Delayed Fusion (RDF) mechanism for precise temporal alignment of interior and exterior feature streams, a Counterfactual Residual Encoding (CRE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent temporal representations. Experimental results demonstrate that CaTFormer attains state-of-the-art performance on the Brain4Cars dataset. It effectively captures complex causal temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

Paper Structure

This paper contains 24 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison between previous driving intention prediction methods and ours. (a) is an LSTM-GRU framework processing interior and exterior streams independently before concatenation. (b) is our CaTFormer, a Transformer-based model enabling dynamic causal fusion of dual streams with integrated intention priors. Through joint modeling of global-local dependencies and cross-stream interactions, our approach outperforms existing methods.
  • Figure 2: Overview of the CaTFormer pipeline. After data preprocessing, exterior optical flow is encoded by ResNet-18 and interior images by MobileFaceNet to produce dual-stream feature sequences. These are then fed into three core modules: (1) Reciprocal Delayed Fusion (RDF) for temporal feature integration; (2) Counterfactual Residual Encoding (CRE) for causal enhancement and intention embedding; and (3) Feature Synthesis Network (FSN) for dynamic fusion of complementary interior, exterior and interaction views to yield the final driving intention prediction.
  • Figure 3: Illustration of Bidirectional Dependency Attention (BDA), where buffered interior and exterior features cross-attend to enhance current-frame representations.
  • Figure 4: The confusion matrix tested on Brain4cars dataset. Left is ours, right is the result of TIFN TIFN2023. The color deepens as the value increases.
  • Figure 5: Temporal Attention and Decision‑Margin Saliency Visualizations. (a) and (b) display the Observed Dependency Attention, where each pixel $w_{ij}$ (row $i$, column $j$) represents the attention weight from the $j$-th query to the $i$-th key for out$\to$in and in$\to$out directions, respectively. (c) and (d) demonstrate the Direct Causal Attention, highlighting frames with significant causal influence. The first column is gray masked to mark shift padding. (e) and (f) overlay the Decision‑Margin Saliency Map on the final interior and exterior frames, each pixel’s intensity defined by $\sum\limits_{t\in\mathcal{T}}\alpha_t\Bigl\lvert\partial\bigl(z_{c^*}- \tfrac{1}{\mathcal{G}-1}\sum\limits_{c\neq c^*} z_c\bigr)/\partial x_t\Bigr\rvert$, where $\alpha_t$ denotes the causal–attention weight for frame $t$, $x_t$ denotes the input feature vector at that pixel, $z_c$ is the final logit for class $c$ and $c^*$ is the predicted class, highlighting regions most responsible for the model’s final decision. Yellow indicates stronger attention.
  • ...and 2 more figures