Table of Contents
Fetching ...

Goal-Based Vision-Language Driving

Santosh Patapati, Trisanth Srinivasan

TL;DR

NovaDrive tackles real-time autonomous driving by unifying perception, mapping, and goal reasoning in a single transformer. It uses a dual-stage, goal-conditioned cross-attention that fuses vision and HD-map information before final reasoning in a partially fine-tuned 11B vision-language backbone, enabling interpretation and decision making in one forward pass. The approach yields higher success rate, improved path efficiency, and lower collision rates on the MD-NEX Outdoor benchmark, with ablations confirming the critical roles of waypoint prompts, map tokens, and the fusion mechanism. The results suggest a practical, explainable, and data-efficient path toward leaner driving stacks with potential applicability beyond driving to other embodied AI domains.

Abstract

Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

Goal-Based Vision-Language Driving

TL;DR

NovaDrive tackles real-time autonomous driving by unifying perception, mapping, and goal reasoning in a single transformer. It uses a dual-stage, goal-conditioned cross-attention that fuses vision and HD-map information before final reasoning in a partially fine-tuned 11B vision-language backbone, enabling interpretation and decision making in one forward pass. The approach yields higher success rate, improved path efficiency, and lower collision rates on the MD-NEX Outdoor benchmark, with ablations confirming the critical roles of waypoint prompts, map tokens, and the fusion mechanism. The results suggest a practical, explainable, and data-efficient path toward leaner driving stacks with potential applicability beyond driving to other embodied AI domains.

Abstract

Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

Paper Structure

This paper contains 20 sections, 1 equation, 2 tables.