Table of Contents
Fetching ...

InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

Ruiqi Song, Xianda Guo, Yanlun Peng, Qinggong Wei, Hangbin Wu, Long Chen

TL;DR

InsightDrive tackles limitations of global, explicit scene representations in end-to-end autonomous driving by combining an attention-centric explicit scene representation with an implicit, Chain-of-Thought–driven reasoning stream. It introduces a lightweight Task-level MoE adapter to inject cognitive priors into perception and planning, and a diffusion-based planner conditioned on both representations. A human–LLM–vehicle knowledge-distillation pipeline transfers driving cognition from CoT prompts to onboard models. Experiments on nuScenes and NAVSIM show state-of-the-art safety and planning robustness with efficient parameter overhead. This approach enhances planning-by-attention and reasoning, offering safer end-to-end driving in complex traffic scenarios.

Abstract

Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.

InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

TL;DR

InsightDrive tackles limitations of global, explicit scene representations in end-to-end autonomous driving by combining an attention-centric explicit scene representation with an implicit, Chain-of-Thought–driven reasoning stream. It introduces a lightweight Task-level MoE adapter to inject cognitive priors into perception and planning, and a diffusion-based planner conditioned on both representations. A human–LLM–vehicle knowledge-distillation pipeline transfers driving cognition from CoT prompts to onboard models. Experiments on nuScenes and NAVSIM show state-of-the-art safety and planning robustness with efficient parameter overhead. This approach enhances planning-by-attention and reasoning, offering safer end-to-end driving in complex traffic scenarios.

Abstract

Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.

Paper Structure

This paper contains 17 sections, 17 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between the proposed insight scene representation end-to-end autonomous driving framework with the conventional pipeline.
  • Figure 2: Framework of our insight scene representation for end-to-end autonomous driving. (a) Insight scene understanding with chain of thought of human. (b) Instructions inspired by the reasoning chain of human drivers. (c) Lightweight end-to-end autonomous driving model.
  • Figure 3: The joint tokens are processed by a decoder block. The refine head outputs class scores and refined anchors $P^{(0)}$, while the corresponding query features $F^{(0)}$ are forwarded to the next block. VLM tokens are preserved across blocks and aggregated as vlm memory for downstream modules.
  • Figure 4: Visualization results of InsightDrive in several scenarios. We observe that InsightDrive performs well on challenging scenarios.