Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving
Md Hasan Shahriar, Md Mohaimin Al Barat, Harshavardhan Sundar, Ning Zhang, Naren Ramakrishnan, Y. Thomas Hou, Wenjing Lou
TL;DR
This work addresses the vulnerability of multimodal fusion in autonomous driving to temporal misalignment by introducing DejaVu, a timing attack that injects malicious delays into sensor streams using in-vehicle networks. The authors develop a system model, threat model, and two delay strategies (constant and random) and evaluate their impact on 3D object detection and multi-object tracking across MVXNet, BEVFusion, and MMF-JDT on KITTI and NuScenes, respectively. Key findings show a single-frame LiDAR delay can reduce 3D detection mAP by up to 88.5%, while a three-frame camera delay can drop MOT MOTA by 73%, with LiDAR delays being particularly dominant for detection and camera delays impacting tracking; these results are validated in hardware-in-the-loop and end-to-end Autoware simulations, where Delayed data can cause collisions or phantom braking. The paper highlights the need for synchronization-aware MMF design and defense strategies, including hardware-anchored timestamps, multi-source time synchronization, temporal-consistency checks, and delay-aware fusion to maintain safety in autonomous systems.
Abstract
Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, an attack that exploits the in-vehicular network and induces delays across sensor streams to create subtle temporal misalignments, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals the sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs, while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. We further demonstrated two attack scenarios using an automotive Ethernet testbed for hardware-in-the-loop validation and the Autoware stack for end-to-end AD simulation, demonstrating the feasibility of the DejaVu attack and its severe impact, such as collisions and phantom braking.
