Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving
Amir Mallak, Alaa Maalouf
TL;DR
The paper tackles the problem that out-of-distribution robustness in vision-based autonomous driving is often captured by a single metric, masking which environmental factors drive policy failures. It proposes a factorized evaluation framework across five axes (scene, season, weather, time, and agents) and tests policies under controlled $k$-factor shifts using the VISTA simulator, comparing FC, CNN, ViT, and foundation-model (FM) feature-based policies with varying in-distribution data, diversity, and temporal context. Key contributions include a formal factorized OOD framework, systematic architectural comparisons under matched budgets, analysis of training data design and diversity, and evaluation of frozen FM features with temporal context; findings show ViT with FM features achieves state-of-the-art robustness (above $85\%$) under up to three simultaneous shifts, while some factor interactions are non-additive. The results yield practical design rules for robust, real-world driving policies, highlighting the value of exposure to hard conditions (e.g., winter/snow, urban environments), the tradeoffs of model latency with FM features, and the importance of diverse ID coverage to strengthen weak OOD cases. Overall, the study provides a structured methodology to diagnose and improve OOD robustness in driving systems, with implications for data collection, simulation curricula, and policy selection.
Abstract
Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.
