Table of Contents
Fetching ...

Reassessing the Validity of Spurious Correlations Benchmarks

Samuel J. Bell, Diane Bouchacourt, Levent Sagun

TL;DR

This work examines benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods, and presents a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.

Abstract

Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.

Reassessing the Validity of Spurious Correlations Benchmarks

TL;DR

This work examines benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods, and presents a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.

Abstract

Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.
Paper Structure (24 sections, 4 equations, 12 figures, 2 tables)

This paper contains 24 sections, 4 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Spurious correlations benchmarks disagree. (a) Correlation between worst-group accuracies on different benchmarks reported by yang2023change. (b) Waterbirds and NICO++ produce disagreeing ranks, such that the best method on Waterbirds (DFR) is the second worst on NICO++.
  • Figure 2: (a) Standard deviation (SD) of test accuracies over groups for an ERM-trained model. (b) SD of worst-group test accuracies over methods. (c) SD of ERM accuracies over groups vs. SD of worst-group test accuracies over methods. Certain benchmarks, e.g. ImageNetBG, do not produce a "worst group", and result in tightly-clustered method performance.
  • Figure 3: Task difficulty due to spurious correlation, as measured by Bayes Factor $K$, on modified benchmarks. Increasing the label-attribute correlation (a, b) and foreground noise (e) increases $K$, while increasing background noise (c) or applying a solid gray background (c, orange point) decreases $K$, except in the case where there is no correlation (d). Attribute noise degrades the efficacy of $K$ (f).
  • Figure 4: Benchmark agreement (Pearson's $r$) as a function of difference in task difficulty due to spurious correlation, as measured by Bayes Factor $K$. Each panel shows the agreement in worst-group test accuracies on the named dataset vs. all other datasets. Only benchmarks valid according to ERM group variability and method variability are included. Valid benchmarks should agree more strongly with those that exhibit a similar $K$, thus exhibiting a negative correlation. Black solid line fit with OLS linear regression.
  • Figure 5: (a) Benchmark agreement (Pearson's $r$) as a function of by difference in task difficulty due to spurious correlation ($K$). I.e., the (negative) slope of each line in \ref{['fig:bayes-factor-vs-correlation']}. (b) $R^2$ of each line. (c) Task difficulty due to spurious correlation, $K$. Valid benchmarks should most agree with other benchmarks with similar $K$, so a large coefficient indicates a more valid benchmark.
  • ...and 7 more figures