InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving
Ruiqi Song, Xianda Guo, Yanlun Peng, Qinggong Wei, Hangbin Wu, Long Chen
TL;DR
InsightDrive tackles limitations of global, explicit scene representations in end-to-end autonomous driving by combining an attention-centric explicit scene representation with an implicit, Chain-of-Thought–driven reasoning stream. It introduces a lightweight Task-level MoE adapter to inject cognitive priors into perception and planning, and a diffusion-based planner conditioned on both representations. A human–LLM–vehicle knowledge-distillation pipeline transfers driving cognition from CoT prompts to onboard models. Experiments on nuScenes and NAVSIM show state-of-the-art safety and planning robustness with efficient parameter overhead. This approach enhances planning-by-attention and reasoning, offering safer end-to-end driving in complex traffic scenarios.
Abstract
Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.
