Table of Contents
Fetching ...

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, Matthias Rottmann

TL;DR

A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs, and it is shown that this method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance.

Abstract

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

TL;DR

A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs, and it is shown that this method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance.

Abstract

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks
Paper Structure (40 sections, 11 equations, 5 figures, 14 tables)

This paper contains 40 sections, 11 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: A monitoring system for autonomous driving that uses a pre-trained Vision Foundation Model’s image encoder to detect distribution shifts in input data. The system identifies whether camera inputs (red dots) fall within the model’s training distribution (brown dots) or represent covariate shifts (top) or semantic shifts (bottom), helping assess operational safety and potential failure risks.
  • Figure 2: Visualization of Mask2Anomaly rai2024mask2anomaly applied to selected images exhibiting semantic and covariate shifts. The left column presents the original images from various datasets, including Lost and Found pinggera2016lost, ACDC Night and Rain sakaridis2021acdc, and SegmentMeIfYouCan Anomaly Track chan2021segmentmeifyoucan The right column displays the corresponding OOD object-level maps generated by Mask2Anomaly.
  • Figure 3: Example images from four subsets of ACDC Dataset sakaridis2021acdc
  • Figure 6: Comparison of AIC values across various model backbones, highlighting the trade-off between model complexity and goodness of fit for GMMs applied to different VFMs. Each plot displays the AIC values for a specific backbone architecture, with the optimal number of GMM components (minimizing AIC) indicated. The size of the latent space vector for each architecture is also annotated.
  • Figure 7: AIC values for ImageNet-trained backbones and autoencoders trained on reference data. Each subplot depicts the AIC profile for a specific architecture or autoencoder configuration, demonstrating the relationship between GMM complexity and fit. The optimal number of components, minimizing AIC, is indicated on each curve.