What could go wrong? Discovering and describing failure modes in computer vision

Gabriela Csurka; Tyler L. Hayes; Diane Larlus; Riccardo Volpi

What could go wrong? Discovering and describing failure modes in computer vision

Gabriela Csurka, Tyler L. Hayes, Diane Larlus, Riccardo Volpi

TL;DR

This work formalizes Language-Based Error Explainability (LBEE), aiming to predict and describe failure modes of computer vision systems in natural language. It proposes an unsupervised, joint vision-language framework (Open-CLIP) that clusters hard/error-prone image subsets and associates clusters with descriptive sentences, using a ground-truth-like metric ${S}^{*}_{\beta}$ to evaluate explanations. It introduces a family of methods (TopS, SetDiff, PDiff, FPdiff) and a cohesive metric suite (AHR, ACR, TPR, JI) to benchmark language-based failure descriptions across segmentation, spurious-context, and ImageNet tasks, with extensive experiments on WD2, IDD, ACDC, NICO${++}$, and ImageNet-1K. The results demonstrate that the approach can recover meaningful, interpretable failure explanations and provide a scalable, interpretable lens on model reliability, with potential to guide data collection and safety-focused debugging.

Abstract

Deep learning models are effective, yet brittle. Even carefully trained, their behavior tends to be hard to predict when confronted with out-of-distribution samples. In this work, our goal is to propose a simple yet effective solution to predict and describe via natural language potential failure modes of computer vision models. Given a pretrained model and a set of samples, our aim is to find sentences that accurately describe the visual conditions in which the model underperforms. In order to study this important topic and foster future research on it, we formalize the problem of Language-Based Error Explainability (LBEE) and propose a set of metrics to evaluate and compare different methods for this task. We propose solutions that operate in a joint vision-and-language embedding space, and can characterize through language descriptions model failures caused, e.g., by objects unseen during training or adverse visual conditions. We experiment with different tasks, such as classification under the presence of dataset bias and semantic segmentation in unseen environments, and show that the proposed methodology isolates nontrivial sentences associated with specific error causes. We hope our work will help practitioners better understand the behavior of models, increasing their overall safety and interpretability.

What could go wrong? Discovering and describing failure modes in computer vision

TL;DR

to evaluate explanations. It introduces a family of methods (TopS, SetDiff, PDiff, FPdiff) and a cohesive metric suite (AHR, ACR, TPR, JI) to benchmark language-based failure descriptions across segmentation, spurious-context, and ImageNet tasks, with extensive experiments on WD2, IDD, ACDC, NICO

, and ImageNet-1K. The results demonstrate that the approach can recover meaningful, interpretable failure explanations and provide a scalable, interpretable lens on model reliability, with potential to guide data collection and safety-focused debugging.

Abstract

Paper Structure (21 sections, 12 equations, 15 figures, 5 tables)

This paper contains 21 sections, 12 equations, 15 figures, 5 tables.

Introduction
Related work
Language-Based Error Explainability
A family of approaches to solve LBEE
Evaluation metrics for LBEE
Experiments
Results
ACDC
IDD
WD2
NICO$_{++}^{85}$
ImageNet 1K
Concluding remarks
Datasets and Tasks
Comparison with prior art
...and 6 more sections

Figures (15)

Figure 2: Overview. Provided with a pretrained model ${\mathcal{M}}_\theta$, a target image set ${\mathcal{X}}$, and a set of sentences ${\mathcal{S}}$, the family of solutions we propose for LBEE follows the following steps. In Step 1, images from the target set ${\mathcal{X}}$ are split into easy and hard samples, based on the model's confidence. In Step 2, samples are embedded using Open-CLIP and the visual embeddings of each set are clustered (independently). In Step 3, for each hard and easy cluster we compute the cosine similarities between the textual embeddings of candidate sentences and the cluster prototypes. We also assign the closest easy prototype to each hard prototype. Step 4 performs sentence selection for that hard cluster, based on these sentence-prototype similarities. Step 5 aggregates the cluster-specific sentence sets to produce the output.
Figure 3: Numerical results (all datasets). From top to bottom: ACR, AHR, TPR and JI scores on NICO${++}$ unsupervised and supervised per-class (first and second column, respectively), ImageNet per-class (third column) and Urban Scene Segmentation (last column).
Figure 4: Two hard clusters in ACDC explained by different methods. We illustrates a few example images from the cluster and below the three sentences retained by different methods for it.
Figure 5: Unsupervised and supervised splitting (ACDC). Results of different metrics (from left to right ACR, AHR, TPR and JI) on the ACDC dataset when the split is done with entropy, pixel accuracy and the mIoU.
Figure 6: Two hard clusters from IDD explained by different methods. We illustrates a few example images from the cluster and below the three sentences retained by different methods for it.
...and 10 more figures

What could go wrong? Discovering and describing failure modes in computer vision

TL;DR

Abstract

What could go wrong? Discovering and describing failure modes in computer vision

Authors

TL;DR

Abstract

Table of Contents

Figures (15)