CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

Lucas Elbert Suryana; Farah Bierenga; Sanne van Buuren; Pepijn Kooij; Elsefien Tulleners; Federico Scari; Simeon Calvert; Bart van Arem; Arkady Zgonnikov

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners, Federico Scari, Simeon Calvert, Bart van Arem, Arkady Zgonnikov

TL;DR

CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving, provides empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.

Abstract

Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

TL;DR

Abstract

Paper Structure (29 sections, 24 equations, 4 figures, 6 tables)

This paper contains 29 sections, 24 equations, 4 figures, 6 tables.

Introduction
Related Work
Vision--Language Models and Evaluation of Automated Driving Decisions
Meaningful Human Control
The CARE-Drive Evaluation Framework
Problem Definition
Use Case: Overtaking Scenario
Stage 1: Prompt Calibration
Calibration parameters
Finding optimal parameters
Stage 2: Contextual Evaluation
Context sensitivity analysis
Binary logit method
Odds ratio and probability interpretation
CARLA validation
...and 14 more sections

Figures (4)

Figure 1: Overview of the CARE-Drive framework for evaluating reason-responsive VLM decision-making in ethically ambiguous driving scenarios.The left panel illustrates the motivating setting: similar driving scenes (e.g., cyclist on a no-passing road) can give rise to conflicting reasons (safety, legality, comfort, efficiency), leading to ambiguity whether to overtake. The middle panel shows how a vision-language model (VLM) receives a scene representation $S=\{V,O\}$, consisting of a visual input $V$ and observable context $O$, together with a decision instruction $I(R,T,L)$ that specifies injected human reasons $R$, a thought strategy $T$, and an explanation-length regime $L$. The VLM produces a decision $D_{VLM}$ with or without human reasons $(R=\varnothing)$, enabling comparison between baseline and reason-augmented behaviour. The right panel depicts the two-stage CARE-Drive evaluation pipeline. In Stage 1 (Prompt Calibration), the model $M$ and thought strategy $T$ are treated as calibration variables. The goal is to identify an optimal configuration $(M^*, T^*)$ that maximises alignment between the reason-augmented VLM decision $V_{VLM}^{(+R)}$ and the expert reference decision $D_{AV}$. In Stage 2 (Contextual Evaluation), the calibrated configuration $(M^*, T^*)$ is held fixed, and the observable context $O$ (e.g., time-to-collision with oncoming vehicles, presence of vehicle behind, passenger urgency) is systematically varied to measure how sensitively VLM augmented with human reasons casually influences decisions across situations. The bottom arrow in blue illustrates CARE-Drive's core outcome: quantifying the alignment between baseline VLM decisions, reason-augmented decisions, and expert judgments.
Figure 2: Visual scenes used in CARE-Drive. The scenes are taken from CARLA simulator. (a) Scenario 1 (baseline) and Scenario 3 (vehicle behind) share the same dashboard image, since rear vehicles are not visible in $V$. (b) Scenario 2 includes an oncoming vehicle visible in the image. The vehicle behind in Scenario 3 is represented only in the observable context $O$.
Figure 3: Illustration of the overtaking scenario with an oncoming vehicle. The passing phase, defined as the moment when the ego vehicle is in the opposing lane while overtaking the cyclist, is used as the reference point for calculating the time-to-collision (TTC) with the oncoming vehicle. The longitudinal distance to the cyclist is defined when the ego vehicle is directly behind the cyclist prior to initiating the overtaking maneuver.
Figure 4: Full-factorial empirical overtaking probability $P(Y=1)$ of the calibrated CARE-Drive configuration, where $(M^*,T^*)=(\texttt{gpt-4.1},\text{Tree-of-Thought})$, under systematic variation of observable context and explanation length. Each subplot shows the empirical overtaking probability $P(Y=1)$ as a function of time-to-collision $TTC_o$. Columns correspond to following time $F\in\{12,18,24\}\,$s, and rows correspond to the explanation-length regime (top row: $L=\text{No-Limit}$; bottom row: $L=\text{Few-Sentences}$). Within each subplot, curves represent combinations of vehicle-behind indicator $B\in\{0,1\}$ and passenger urgency $U\in\{0,1\}$. Each point represents the proportion of overtaking decisions over 30 stochastic runs per condition.

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

TL;DR

Abstract

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (4)