Table of Contents
Fetching ...

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, Percy Liang

TL;DR

This paper investigates why overparameterization can worsen worst-group generalization when data contain spurious correlations, despite improving average test error. It combines empirical studies on CelebA and Waterbirds with synthetic simulations to identify two key data properties that modulate this effect: the majority/minority group proportion $p_\mathsf{maj}$ and the spurious-to-core information ratio $r_\mathsf{s:c}$. A theoretical analysis in an explicit-memorization linear setting shows that the minimum-norm inductive bias in overparameterized regimes can lead models to memorize minority points via spurious features, increasing worst-group error; underparameterized models that rely on core features tend to generalize better across groups. The paper also demonstrates that subsampling the majority group can counterintuitively reduce worst-group error in the overparameterized regime, sometimes matching or beating reweighted underparameterized baselines, indicating a fundamental trade-off between average accuracy and worst-group robustness. Together, these results highlight a tension between the benefits of overparameterization for average performance and the need to preserve robust performance on underrepresented groups.

Abstract

We study why overparameterization -- increasing model size well beyond the point of zero training error -- can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and theoretically show how the inductive bias of models towards "memorizing" fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

TL;DR

This paper investigates why overparameterization can worsen worst-group generalization when data contain spurious correlations, despite improving average test error. It combines empirical studies on CelebA and Waterbirds with synthetic simulations to identify two key data properties that modulate this effect: the majority/minority group proportion $p_\mathsf{maj}$ and the spurious-to-core information ratio $r_\mathsf{s:c}$. A theoretical analysis in an explicit-memorization linear setting shows that the minimum-norm inductive bias in overparameterized regimes can lead models to memorize minority points via spurious features, increasing worst-group error; underparameterized models that rely on core features tend to generalize better across groups. The paper also demonstrates that subsampling the majority group can counterintuitively reduce worst-group error in the overparameterized regime, sometimes matching or beating reweighted underparameterized baselines, indicating a fundamental trade-off between average accuracy and worst-group robustness. Together, these results highlight a tension between the benefits of overparameterization for average performance and the need to preserve robust performance on underrepresented groups.

Abstract

We study why overparameterization -- increasing model size well beyond the point of zero training error -- can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and theoretically show how the inductive bias of models towards "memorizing" fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.

Paper Structure

This paper contains 44 sections, 29 theorems, 110 equations, 14 figures.

Key Result

Theorem 1

For any $p_\mathsf{maj} \geq \bigl(1-\frac{1}{2001}\bigr)$, $\sigma_\mathsf{core}^2 \geq 1$, $\sigma_\mathsf{spu}^2 \leq \frac{1}{16 \log 100 n_\mathsf{maj}}$, $\sigma_\mathsf{noise}^2 \leq \frac{n_\mathsf{maj}}{600^2}$ and $n_\mathsf{min} \geq 100$, there exists $N_0$ such that for all $N > N_0$ (o where ${\hat{w}^\mathsf{mm}}$ is the max-margin classifier. However, for $N=0$ (underparameterized

Figures (14)

  • Figure 1: Top: Overparameterization hurts test error on the worst group when models are trained with the reweighted objective that upweights minority groups (Equation \ref{['eqn:reweight']}). Without reweighting, models have poor worst-group error regardless of model size (Appendix \ref{['sec:appendix_erm']}). Bottom: Consider data points $(x, y)$, where $x \in \mathbb{R}^2$ comprises a core feature $x_\mathsf{core}$ (x-axis) and a spurious feature $x_\mathsf{spu}$ (y-axis). The label $y$ is highly correlated with $x_\mathsf{spu}$, except on two minority groups (crosses). Underparameterized models use the core feature (left), but overparameterized models use the spurious feature and memorize the minority points (right).
  • Figure 2: We consider two image datasets, CelebA and Waterbirds, where the label $y$ is correlated with a spurious attribute $a$ in a majority of the training data. The % beside each group shows its frequency in the training data. To measure how robust a model is to the spurious attribute, we divide the data into groups based on $(y, a)$ and record the highest error incurred by a group. Figure adapted from sagawa2020group.
  • Figure 3: Increasing overparameterization (i.e., increasing model size) hurts the worst-group test error even though it improves the average test error. Here, we show results for models trained on the reweighted objective for CelebA (left) and Waterbirds (right).
  • Figure 4: Overparameterization hurts worst-group test error but improves average test error on synthetic data, reproducing the trends we observe in real data.
  • Figure 5: Overparameterized models have poor worst-group performance on the synthetic data because they rely on spurious features. Left: removing the spurious feature (green) eliminates the detrimental effect of overparameterization. Right: overparamerized models do well on the majority groups where the spurious features match the label, but poorly on the minority groups.
  • ...and 9 more figures

Theorems & Definitions (52)

  • Theorem 1
  • Definition 1: $\gamma$-memorization
  • Proposition 1: Norm of models using the spurious feature
  • proof : Proof sketch
  • Proposition 2: Norm of models using the core feature
  • proof : Proof sketch
  • Definition 2: $\gamma$-memorization
  • Theorem 2
  • proof
  • Definition 3
  • ...and 42 more