Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Amir Mallak; Alaa Maalouf

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Amir Mallak, Alaa Maalouf

TL;DR

The paper tackles the problem that out-of-distribution robustness in vision-based autonomous driving is often captured by a single metric, masking which environmental factors drive policy failures. It proposes a factorized evaluation framework across five axes (scene, season, weather, time, and agents) and tests policies under controlled $k$-factor shifts using the VISTA simulator, comparing FC, CNN, ViT, and foundation-model (FM) feature-based policies with varying in-distribution data, diversity, and temporal context. Key contributions include a formal factorized OOD framework, systematic architectural comparisons under matched budgets, analysis of training data design and diversity, and evaluation of frozen FM features with temporal context; findings show ViT with FM features achieves state-of-the-art robustness (above $85\%$) under up to three simultaneous shifts, while some factor interactions are non-additive. The results yield practical design rules for robust, real-world driving policies, highlighting the value of exposure to hard conditions (e.g., winter/snow, urban environments), the tradeoffs of model latency with FM features, and the importance of diverse ID coverage to strengthen weak OOD cases. Overall, the study provides a structured methodology to diagnose and improve OOD robustness in driving systems, with implications for data collection, simulation curricula, and policy selection.

Abstract

Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

TL;DR

-factor shifts using the VISTA simulator, comparing FC, CNN, ViT, and foundation-model (FM) feature-based policies with varying in-distribution data, diversity, and temporal context. Key contributions include a formal factorized OOD framework, systematic architectural comparisons under matched budgets, analysis of training data design and diversity, and evaluation of frozen FM features with temporal context; findings show ViT with FM features achieves state-of-the-art robustness (above

) under up to three simultaneous shifts, while some factor interactions are non-additive. The results yield practical design rules for robust, real-world driving policies, highlighting the value of exposure to hard conditions (e.g., winter/snow, urban environments), the tradeoffs of model latency with FM features, and the importance of diverse ID coverage to strengthen weak OOD cases. Overall, the study provides a structured methodology to diagnose and improve OOD robustness in driving systems, with implications for data collection, simulation curricula, and policy selection.

Abstract

-factor perturbations (

). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural

urban and day

night (

each); actor swaps

, moderate rain

; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above

under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below

by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness (

points from

traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD

) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

Paper Structure (30 sections, 3 equations, 9 figures, 1 table)

This paper contains 30 sections, 3 equations, 9 figures, 1 table.

Introduction
Our contributions
Related Work
Experimental Setup
Key questions we address
Task Formulation and Platform
Environment Factorization and Distribution Shifts
Levels of OOD shifts
Basic Policy Models and Training
Foundation-model Feature Policies (Sec. \ref{['subsec:fm']}).
Evaluation Metrics and Protocol
Study S1: Architecture Robustness and OOD Factorized Shifts
Study S2: Effect of the ID Training Distribution
Study S3: Foundation-Model Features with the Best Backbone
Study S4: Data Scale and Diversity vs. Specialization
...and 15 more sections

Figures (9)

Figure 1: Accuracy as a function of runtime across model variants.
Figure 2: Effect of changes across one, two, and three simultaneous factors. Key: Sc, scene (R, rural; U, urban), Se, season (Wi, winter; Sp, spring; Su, summer; Fa, fall), We, weather (Dr, dry; Ra, rain; Sn, snow), Ti, time (D, day; N, night), Ag, agents (C, car; An, animal).
Figure 3: Themed star plots for multi factor shifts. Each subplot aggregates all shifts matching the factor theme.
Figure 4: Accuracy as a function of environment factors changes across model variants.
Figure 5: Accuracy trends under training distribution, environmental changes, and number of training traces.
...and 4 more figures

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

TL;DR

Abstract

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (9)