Table of Contents
Fetching ...

Does Weak-to-strong Generalization Happen under Spurious Correlations?

Chenruo Liu, Yijun Dong, Qi Lei

Abstract

We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $η_\ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction $η_u$. Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when $η_u = η_\ell$ but may fail when $η_u \ne η_\ell$, where W2S gain diminishes as $(η_u - η_\ell)^2$ increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.

Does Weak-to-strong Generalization Happen under Spurious Correlations?

Abstract

We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction , and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction . Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when but may fail when , where W2S gain diminishes as increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.

Paper Structure

This paper contains 47 sections, 8 theorems, 85 equations, 7 figures, 10 tables.

Key Result

Theorem 1

Under asm:high_dim_asymp_regime, eq:sft satisfies

Figures (7)

  • Figure 1: Visualization of the theoretical setup in \ref{['asm:reg_spur_corr', 'asm:weak_strong_rep']} through \ref{['ex:waterbirds']}.
  • Figure 2: W2S gains across different combinations of $\eta_\ell$ and $\eta_t$. Each panel shows theoretical (solid lines) and empirical (circles) results for W2S gain as a function of $\eta_u$, across different $\nu_z$ values. Here we fix $\boldsymbol{\mu}_T$, $\boldsymbol{\mu}_S$, $\boldsymbol{\Xi}$, and $d_z$ with $\|\boldsymbol{\mu}_T\|^2_2=10.0$, $\|\boldsymbol{\mu}_S\|^2_2=0.1$, $\|\boldsymbol{\Xi}\|_F^2=0.1p_S$. Vertical dashed lines indicate the theoretical optimal $\eta_u^\star$ values that maximize W2S gain.
  • Figure 3: Impact of $\boldsymbol{\mu}_S$ and $\boldsymbol{\Xi}$ on W2S gain. Both panels show theoretical (solid lines) and empirical (circles) results for W2S gain as a function of $\eta_u$. Fixed parameters: $\eta_\ell = 0.1$, $\eta_t = 0.5$, $\nu_z = 0.04$, $\|\boldsymbol{\mu}_T\|^2_2=10.0$. Left: varying $\|\boldsymbol{\mu}_S\|_2^2$ with fixed $\|\boldsymbol{\Xi}\|_F^2=0.1p_S$. Right: varying $\|\boldsymbol{\Xi}\|_F^2$ with fixed $\|\boldsymbol{\mu}_S\|_2^2=0.1$. Dashed lines indicate the theoretical optimal $\eta_u^\star$ values that maximize W2S gain.
  • Figure 4: Average W2S gain across all teacher-student pairs as a function of $\eta_u$ on all four datasets. Top row: average accuracy; bottom row: worst group accuracy. Left column fixes $\eta_\ell=0.5$; right column fixes $\eta_\ell=\eta_o$. For $\eta_\ell = 0.5$, curves are plotted over a shared $\eta_u$ interval aligned across datasets (bounded by minority group sample availability) to enable direct comparability. For $\eta_\ell=\eta_o$, each dataset is plotted from its own $\eta_o$ (0.05, 0.005, 0.05, and 0 for Waterbirds, BFFHQ, BG-COCO, and ImageNet-9, respectively) up to $0.5$. ImageNet-9 does not have a clearly defined worst group and is therefore omitted from the bottom panels.
  • Figure 5: Comparison of Enhanced-W2S and original W2S for the (Clipb32, DINOv2) pair on BG-COCO, Waterbirds, and BFFHQ. Top row: worst group accuracy with $\eta_\ell=\eta_o,\ \eta_u=0.5$ (fixed $N$, varying $n$). Bottom row: worst group accuracy with $\eta_\ell=0.5,\ \eta_u=\eta_o$ (fixed $n$, varying $N$).
  • ...and 2 more figures

Theorems & Definitions (20)

  • Definition 1: Regression under spurious correlations
  • Definition 2: Weak vs. strong models
  • Example 1
  • Definition 3: Teacher-student similarity
  • Remark 1: Why ridgeless regression provides sufficient regularization?
  • Theorem 1: SFT of weak teacher (\ref{['apx:pf_thm_sft_weak']})
  • Theorem 2: W2S, formally in \ref{['thm:w2s_strong_ridgeless_formal']}
  • Remark 2: Does W2S happen under spurious correlations?
  • Lemma 1: Population SFT of weak teacher
  • proof : Proof of \ref{['lem:pop_weak']}
  • ...and 10 more