SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li
TL;DR
SafeAuto tackles the safety-critical gap in multimodal autonomous driving by fusing a novel Position-Dependent Cross-Entropy loss for accurate low-level control, a Markov Logic Network-based safety verifier to enforce traffic rules, and a Multimodal Retrieval-Augmented Generation module to learn from past experiences. The approach yields substantial improvements on both low-level predictions (speed, steering) and high-level behaviors (action, justification) across BDD-X and DriveLM datasets, while enabling explicit safety checks that can override unsafe MLLM outputs. By grounding decisions in structured knowledge and rich multimodal context, SafeAuto demonstrates enhanced reliability and safety potential for real-world autonomous driving systems. The work provides a scalable, plug-in framework with public code to integrate safety reasoning into MLLM-driven AD pipelines.
Abstract
Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light $\implies$ stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.
