Table of Contents
Fetching ...

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

Dunant Cusipuma, David Ortega, Victor Flores-Benites, Arturo Deza

TL;DR

Robusto-1 investigates cognitive alignment between humans and Vision-Language Models in real-world autonomous driving under out-of-distribution conditions by framing a Visual Question Answering task. The authors introduce a Peru-based dashcam dataset, sample 5-second scenes, and generate 15 questions per clip across three blocks (variable, multiple-choice, counterfactual) using an Oracle LLM. They apply Representational Similarity Analysis with sentence embeddings to compare human and VLM responses, revealing that VLMs are relatively aligned with each other while humans diverge, especially on counterfactual questions. The study highlights nuanced differences in representational structure between humans and machines and emphasizes that surface-level answer similarity does not imply shared internal representations, suggesting future work linking behavior with neural or cognitive data. The work provides a framework and dataset for evaluating AV systems on real-world, diverse driving contexts and underscores the need for deeper alignment between human cognition and AI decision-making in safety-critical settings.

Abstract

As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

TL;DR

Robusto-1 investigates cognitive alignment between humans and Vision-Language Models in real-world autonomous driving under out-of-distribution conditions by framing a Visual Question Answering task. The authors introduce a Peru-based dashcam dataset, sample 5-second scenes, and generate 15 questions per clip across three blocks (variable, multiple-choice, counterfactual) using an Oracle LLM. They apply Representational Similarity Analysis with sentence embeddings to compare human and VLM responses, revealing that VLMs are relatively aligned with each other while humans diverge, especially on counterfactual questions. The study highlights nuanced differences in representational structure between humans and machines and emphasizes that surface-level answer similarity does not imply shared internal representations, suggesting future work linking behavior with neural or cognitive data. The work provides a framework and dataset for evaluating AV systems on real-world, diverse driving contexts and underscores the need for deeper alignment between human cognition and AI decision-making in safety-critical settings.

Abstract

As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

Paper Structure

This paper contains 36 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: As multi-modal foundation models start being tested for Autonomous Driving applications, we inquire their cognitive alignment under a Visual Question Answering scheme of multiple videos comparing the answers of VLMs to those of Humans with tools from systems neuroscience. For this example in particular, a closer look reveals that the policeman is telling the driver to run through the red light. These sort of edge-case scenarios allow us to better probe cognitive alignment.
  • Figure 2: Overview of the VQA procedure on the Robusto-1 Dataset. A set of 5 second clips are seen by ground truth anotators (authors) and Meta-Tags are extracted from 16 different categories. These are then passed per each video to a "Blind Oracle" LLM that formulates a set of 5 variable questions per clip. An addition 5 set of multiple choice questions that have Yes/No answers and or involved rating or counting, and 5 open counterfactual questions are added to the total pool of 15 questions per clip. We then ask a group of VLMs and Humans these questions to collect their answers.
  • Figure 3: A figure that shows how to calculate the System Similarity Matrix through Model Gramians as done in Representational Similarity Analysis (RSA) kriegeskorte2008representational. A) We transform each answer into a vector through an embedding to later calculate each system's Gramian. Upper triangular parts of the Gramians across two systems are then correlated (violet). This can be applied to both humans and VLMs. B) The system similarity matrix $\mathcal{M}$ calculated over all humans and machines allows us to get an idea of how each system is similar to one another. A cartoon with no real values is shown in this diagram.
  • Figure 4: The first general result we find after applying Representational Similarity Analysis (RSA) to responses of both humans and VLMs, is that system convergence and divergence is modulated by the type of questions asked. Broadly speaking, we find that all VLMs respond very similar to each other independent of the types of questions asked with a surprisingly high correlation for counterfactuals & hypotheticals. Humans on the other hand diverge heavily for counterfactual & hypotheticals and converge strongly for multiple-choice.
  • Figure 5: In this figure we show the distance of each response per question across all systems to the median response. Responses placed here for the VLM was the average response per question rather than a single response. We generally observe that the overlap across the answers for VLMs and Humans shifts depending on the nature of the questions asked with a larger partial overlap for block 2 given the nature of multiple- choice questions and the smaller space that answers can space as they are prefixed. Variance for block 3 on the other hand is larger across humans and VLMs given the complexity of counterfactual & hypothetical questions.
  • ...and 11 more figures