Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
Aodi Wu, Xubo Luo
TL;DR
The paper tackles the reliability and spatial-grounding gaps of vision-language models in autonomous driving by introducing a four-part framework that decouples tasks via a Mixture-of-Prompts router, employs task-specific prompts with explicit coordinate grounding and structured reasoning (CoT/ToT), and uses adaptive visual assembly with per-task inference settings on a strong backbone. Key contributions include the Mixture-of-Prompts router to prevent prompt interference, explicit multi-view spatial grounding with visual markers, and reasoning-driven prompts that adapt to perception, prediction, planning, and corruption-detection tasks; these components yield substantial performance gains on Phase-1 ($$70.87\%$$) and Phase-2 ($$72.85\%$$) of the RoboSense challenge, demonstrating that structured prompting and spatial grounding can markedly enhance safety-critical VLM performance without fine-tuning. The approach also integrates dynamic temporal context, reference images for corruption detection, and ToT exploration to handle uncertainty, reinforcing reliability in real-world driving scenarios. The work has practical impact by providing a scalable, prompt-engineering-based method to improve VLM-driven autonomous driving understanding, offering a blueprint for robust, non-finetuned deployment in safety-critical settings.
Abstract
This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.
