Table of Contents
Fetching ...

New Spiking Architecture for Multi-Modal Decision-Making in Autonomous Vehicles

Aref Ghoreishee, Abhishek Mishra, Lifeng Zhou, John Walsh, Nagarajan Kandasamy

TL;DR

This work tackles high-level decision-making for autonomous vehicles by fusing camera BEV, LiDAR, and IMU data through a cross-attention-based module. It introduces a spiking, temporal-aware ternary attention (TTSA) to enable energy-efficient, edge-deployable multi-modal fusion within an end-to-end MM-DQN framework. Compared to uni-modal baselines, MM-DQN improves decision quality, while TTSA narrows the gap between spiking and non-spiking approaches and enhances temporal representation and safety. Experiments on Highway-Env show TTSA achieves competitive rewards with substantially higher spike sparsity, indicating meaningful gains in both performance stability and energy efficiency for real-time autonomous driving tasks.

Abstract

This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.

New Spiking Architecture for Multi-Modal Decision-Making in Autonomous Vehicles

TL;DR

This work tackles high-level decision-making for autonomous vehicles by fusing camera BEV, LiDAR, and IMU data through a cross-attention-based module. It introduces a spiking, temporal-aware ternary attention (TTSA) to enable energy-efficient, edge-deployable multi-modal fusion within an end-to-end MM-DQN framework. Compared to uni-modal baselines, MM-DQN improves decision quality, while TTSA narrows the gap between spiking and non-spiking approaches and enhances temporal representation and safety. Experiments on Highway-Env show TTSA achieves competitive rewards with substantially higher spike sparsity, indicating meaningful gains in both performance stability and energy efficiency for real-time autonomous driving tasks.

Abstract

This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.

Paper Structure

This paper contains 34 sections, 2 theorems, 33 equations, 7 figures, 1 table.

Key Result

Proposition 1

The representational capacity $C(X)$ achieves its maximum when $X$ follows a uniform distribution, i.e., $p_X(x) = 1/N$, where $N$ is the number of samples. Under this condition,

Figures (7)

  • Figure 1: The MM-DQN architecture that uses a cross-attention module to fuse BEV images and LiDAR data for RL-based decision making. The figure also shows the choice between ReLU and spiking activations; in the traditional ANN, ReLU activations are used throughout, whereas all ReLU activations are replaced by binary spiking neurons in case of the SNN.
  • Figure 2: Transformer-like cross-attention mechanism.
  • Figure 3: The standard spiking attention mechanism compared with our proposed temporal-aware approach.
  • Figure 4: The average reward obtained by MM-DQN compared to uni-modal DQNs operating with a stack of four frames and a single frame.
  • Figure 5: Average reward, crash frequency, and speed obtained by non-spiking, spiking with SSA, and spiking with TTSA architectures for the Highway scenario
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2