Table of Contents
Fetching ...

Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving

Md Hasan Shahriar, Md Mohaimin Al Barat, Harshavardhan Sundar, Ning Zhang, Naren Ramakrishnan, Y. Thomas Hou, Wenjing Lou

TL;DR

This work addresses the vulnerability of multimodal fusion in autonomous driving to temporal misalignment by introducing DejaVu, a timing attack that injects malicious delays into sensor streams using in-vehicle networks. The authors develop a system model, threat model, and two delay strategies (constant and random) and evaluate their impact on 3D object detection and multi-object tracking across MVXNet, BEVFusion, and MMF-JDT on KITTI and NuScenes, respectively. Key findings show a single-frame LiDAR delay can reduce 3D detection mAP by up to 88.5%, while a three-frame camera delay can drop MOT MOTA by 73%, with LiDAR delays being particularly dominant for detection and camera delays impacting tracking; these results are validated in hardware-in-the-loop and end-to-end Autoware simulations, where Delayed data can cause collisions or phantom braking. The paper highlights the need for synchronization-aware MMF design and defense strategies, including hardware-anchored timestamps, multi-source time synchronization, temporal-consistency checks, and delay-aware fusion to maintain safety in autonomous systems.

Abstract

Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, an attack that exploits the in-vehicular network and induces delays across sensor streams to create subtle temporal misalignments, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals the sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs, while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. We further demonstrated two attack scenarios using an automotive Ethernet testbed for hardware-in-the-loop validation and the Autoware stack for end-to-end AD simulation, demonstrating the feasibility of the DejaVu attack and its severe impact, such as collisions and phantom braking.

Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving

TL;DR

This work addresses the vulnerability of multimodal fusion in autonomous driving to temporal misalignment by introducing DejaVu, a timing attack that injects malicious delays into sensor streams using in-vehicle networks. The authors develop a system model, threat model, and two delay strategies (constant and random) and evaluate their impact on 3D object detection and multi-object tracking across MVXNet, BEVFusion, and MMF-JDT on KITTI and NuScenes, respectively. Key findings show a single-frame LiDAR delay can reduce 3D detection mAP by up to 88.5%, while a three-frame camera delay can drop MOT MOTA by 73%, with LiDAR delays being particularly dominant for detection and camera delays impacting tracking; these results are validated in hardware-in-the-loop and end-to-end Autoware simulations, where Delayed data can cause collisions or phantom braking. The paper highlights the need for synchronization-aware MMF design and defense strategies, including hardware-anchored timestamps, multi-source time synchronization, temporal-consistency checks, and delay-aware fusion to maintain safety in autonomous systems.

Abstract

Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, an attack that exploits the in-vehicular network and induces delays across sensor streams to create subtle temporal misalignments, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals the sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs, while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. We further demonstrated two attack scenarios using an automotive Ethernet testbed for hardware-in-the-loop validation and the Autoware stack for end-to-end AD simulation, demonstrating the feasibility of the DejaVu attack and its severe impact, such as collisions and phantom braking.

Paper Structure

This paper contains 29 sections, 6 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Impact of DejaVu attack on 3D object detection. (a-b) show two benign scenarios without any temporal misalignment, and hence, accurate object detection. (c-d) Illustrate different temporal alignments between camera and LiDAR inputs and highlight how delayed sensor data can lead to incorrect detections, either by detecting non-existent objects (false positives) or missing present ones (false negatives). In (c), the MMF prioritizes the (delayed) LiDAR data and predicts three objects, including a pedestrian who is not present in the current camera view, resulting in seemingly accurate results from LiDAR's perspective but a false detection from the camera's perspective. However, in (d), the MMF still prioritizes the (updated) LiDAR data and predicts two objects, excluding the pedestrian who is present in the current camera view, resulting in a missed detection from the camera's perspective. In both temporal misalignment attack cases, MMF biased its fusion toward LiDAR, failing to account for semantic discrepancies in the camera modality.
  • Figure 2: Overview of the proposed system modeling with DejaVu attack.
  • Figure 3: Uni-DejaVu attack impacts on 3D object detection performance of MVXNet on KITTI dataset for different object classes.
  • Figure 4: Mul-DejaVu attack impacts on 3D object detection performance of MVXNet on KITTI dataset for different object classes.
  • Figure 5: Uni-DejaVu attack impacts on 3D object detection performance of BEVFusion on nuScenes dataset for different object classes.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: $x_i^{(t_{\text{act}})},t_{\text{pre}}$