Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios
Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi
TL;DR
This work probes visual reasoning capabilities in vision-language models within real-world driving scenarios by introducing DrivingVQA, a dataset with expert explanations and bounding-box grounded entities. It then presents RIV-CoT, a retrieval-based interleaved visual chain-of-thought prompting method that grounds reasoning in retrieved image crops of relevant entities. Experiments show that RIV-CoT improves both answer accuracy and reasoning correctness over vanilla CoT and scales to larger datasets like A-OKVQA using automatically generated pseudo-labels. The results underscore the value of explicit visual grounding and crop-based reasoning for complex multimodal tasks and point to scalable pathways for applying grounded CoT in broader domains.
Abstract
While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our method effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.
