Table of Contents
Fetching ...

Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios

Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, Alexandre Alahi

TL;DR

This work probes visual reasoning capabilities in vision-language models within real-world driving scenarios by introducing DrivingVQA, a dataset with expert explanations and bounding-box grounded entities. It then presents RIV-CoT, a retrieval-based interleaved visual chain-of-thought prompting method that grounds reasoning in retrieved image crops of relevant entities. Experiments show that RIV-CoT improves both answer accuracy and reasoning correctness over vanilla CoT and scales to larger datasets like A-OKVQA using automatically generated pseudo-labels. The results underscore the value of explicit visual grounding and crop-based reasoning for complex multimodal tasks and point to scalable pathways for applying grounded CoT in broader domains.

Abstract

While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our method effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.

Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios

TL;DR

This work probes visual reasoning capabilities in vision-language models within real-world driving scenarios by introducing DrivingVQA, a dataset with expert explanations and bounding-box grounded entities. It then presents RIV-CoT, a retrieval-based interleaved visual chain-of-thought prompting method that grounds reasoning in retrieved image crops of relevant entities. Experiments show that RIV-CoT improves both answer accuracy and reasoning correctness over vanilla CoT and scales to larger datasets like A-OKVQA using automatically generated pseudo-labels. The results underscore the value of explicit visual grounding and crop-based reasoning for complex multimodal tasks and point to scalable pathways for applying grounded CoT in broader domains.

Abstract

While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our method effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.
Paper Structure (34 sections, 10 figures, 7 tables)

This paper contains 34 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of retrieval-based interleaved visual chain-of-thought in DrivingVQA. Successfully answering the question requires detecting relevant entities (e.g., the truck, the car in the rear-view mirror), recognizing their attributes (e.g., the car signaling to overtake), and reasoning spatially to determine whether overtaking is safe. The interleaved explanation provides step-by-step reasoning aligned with visual content.
  • Figure 2: DrivingVQA example with a multiple-choice question, a set of relevant entities with their coordinates, and an expert-written explanation describing the situation step by step.
  • Figure 3: DrivingVQA dataset statistics.
  • Figure 4: Illustration of multi-step retrieval-based generation. During inference, starting with a tokenized question and an image tokenized by the adapter on the output of the Vision Encoder, the Large Language Model generates output until it predicts a bounding box. At this point, the generation process pauses to extract the corresponding image crop based on the predicted coordinates. The image crop is encoded and adapted into an image crop token, which is then added back into the model’s context along with the question, image tokens and previously generated outputs. This iterative process continues until the model produces its final answer.
  • Figure 5: Interleaved explanation augmentation. We feed GPT-4o with the question and possible answers, the image, the list of relevant entities and coordinates, and the original expert-written explanation. The resulting interleaved explanation refers to the relevant entities early in the sentences, allowing the reasoning process to be conditioned on the content of the image crops.
  • ...and 5 more figures