Table of Contents
Fetching ...

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

TL;DR

The paper proves that weak-to-strong generalization does not require complex, highly-parameterized learners: it can occur in two-layer random-feature networks where a large student trained with early stopping on teacher-provided labels significantly outperforms a smaller teacher. It provides precise upper bounds showing multiplicative improvements and, under Gaussian universality, polynomial improvements (e.g., ${ m L}_{ST} = ilde{O}({ m L}_{TE}^{1.49})$ for ReLU RFs and ${ m L}_{ST}= ilde{O}({ m L}_{TE}^2)$ for linear RFs), while also establishing quadratic lower bounds that cap how much improvement is possible. The analysis uses a gradient-flow framework, kernel (neural tangent) perspectives, and a novel teacher-student feature-alignment quantity $oldsymbol{ kappa}_S$ to quantify misalignment and its impact on generalization. A key takeaway is that early stopping enables denoising of the teacher’s noise directions, facilitating strong improvements, but there are fundamental limits: in RF models, ${ m L}_{ST} leq c imes { m L}_{TE}^eta$ with $eta>2$. The work thus clarifies the mechanism and boundaries of W2S, offering theoretical grounding for observed phenomena in larger-scale systems and guiding future exploration of inductive biases and stopping criteria.

Abstract

Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

TL;DR

The paper proves that weak-to-strong generalization does not require complex, highly-parameterized learners: it can occur in two-layer random-feature networks where a large student trained with early stopping on teacher-provided labels significantly outperforms a smaller teacher. It provides precise upper bounds showing multiplicative improvements and, under Gaussian universality, polynomial improvements (e.g., for ReLU RFs and for linear RFs), while also establishing quadratic lower bounds that cap how much improvement is possible. The analysis uses a gradient-flow framework, kernel (neural tangent) perspectives, and a novel teacher-student feature-alignment quantity to quantify misalignment and its impact on generalization. A key takeaway is that early stopping enables denoising of the teacher’s noise directions, facilitating strong improvements, but there are fundamental limits: in RF models, with . The work thus clarifies the mechanism and boundaries of W2S, offering theoretical grounding for observed phenomena in larger-scale systems and guiding future exploration of inductive biases and stopping criteria.

Abstract

Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

Paper Structure

This paper contains 52 sections, 53 theorems, 179 equations, 6 figures.

Key Result

Theorem 3.1

Consider the ReLU ass:relu and the weak-to-strong setup of Section sec:setup, for any dimension $d$ and teacher size $M_{\mathrm{TE}}$. Consider a target $f^*$ that is an even polynomialOr more generally, a sum of a linear function and an even polynomial. of degree at most $k$ normalized s.t. $\math and so ${\mathcal{L}_{\mathrm{ST}}}/\mathcal{L}_{\mathrm{TE}}\to 0$ and $\mathrm{PGR}\to 1$ as $M_{

Figures (6)

  • Figure 1: Illustration of our setup for weak-to-strong generalization. The teacher model of smaller size is trained on ground truth labels, while the student model of larger size is trained with early stopping on labels produced by the teacher model.
  • Figure 2: Weak-to-strong generalization happens in ReLU random feature networks (\ref{['ass:relu']}) with input dimension $d=32$, student size $M_{\mathrm{ST}}=16384$, and teacher size $M_{\mathrm{TE}}\in\{16, \ldots, 256\}$. We consider a linear target function $f^*({\bm{x}})= \langle \beta, {\bm{x}}\rangle$ for unit norm some $\beta$. \ref{['fig:exp:2layer-loss-ratio']} plots the ratio between student loss ${\mathcal{L}_{\mathrm{ST}}}$ and teacher loss $\mathcal{L}_{\mathrm{TE}}$, with varying teacher size $M_{\mathrm{TE}}$ and gradient flow training time $t$. With appropriate stopping time, we see a significant weak-to-strong generalization gain. This gain diminishes with overtraining and running gradient flow to convergence, the student mimics the teacher, has the same error, and does not excite weak-to-strong generalization. In \ref{['fig:exp:2layer-loss-fitting']}, we fit the minimal student loss ${\mathcal{L}_{\mathrm{ST}}}$ (at the optimal stopping time for each teacher size) as a power law function of the student loss $\mathcal{L}_{\mathrm{TE}}$, confirming Theorem \ref{['thrm:main:2layerrelu']}. See \ref{['sec:exp']} for simulation details.
  • Figure 3: Weak-to-Strong generalization happens in random linear feature networks (\ref{['ass:linearnetwork']}). Here we used an input distribution as in \ref{['thrm:main:diagfeatcov']}, with $k=1$ and a target function $f^*=\langle e_1, {\bm{x}}\rangle$ where $e_1$ is the first standard basis vector. \ref{['fig:exp:linear-loss-ratio']} plots the ratio between the student loss ${\mathcal{L}_{\mathrm{ST}}}$ and squared teacher loss $\mathcal{L}_{\mathrm{TE}}^2$, with varying teacher size $M_{\mathrm{TE}}$, and where the dimensionality $d=M_{\mathrm{TE}}^{3/2}$ as set as in the scaling of \ref{['thrm:main:diagfeatcov']}, as a function of the gradient flow time $t$. With proper early stopping time ${\mathcal{L}_{\mathrm{ST}}}/\mathcal{L}_{\mathrm{TE}}^2$ converges to approximately $1$ as $M_{\mathrm{TE}}$ grows, confirming that for large $M_{\mathrm{TE}}$ we have ${\mathcal{L}_{\mathrm{ST}}} \propto \mathcal{L}_{\mathrm{TE}}^2$ as in \ref{['thrm:main:diagfeatcov']}. This is also confirmed in \ref{['fig:exp:linear-loss-fitting']}, where we fit the student loss ${\mathcal{L}_{\mathrm{ST}}}$ as a power law function of teacher loss $\mathcal{L}_{\mathrm{TE}}$, and recover an excellent fit with an exponent very close to $2$. We again see that overtraining diminishes weak-to-strong generalization. See \ref{['sec:exp']} for simulation details.
  • Figure 4: Weak-to-strong generalization happens in 2-layer ReLU networks with input dimension $d=16,32,64$, student size $M_{\mathrm{ST}}=16384$, and teacher size $M_{\mathrm{TE}}\in\{16, \ldots, 256\}$. We consider target function $f^*$ be a linear function, i.e., $f^*= \langle \beta, {\bm{x}}\rangle$ for some $\beta$ of unit norm. The top figures plots the ratio between student loss ${\mathcal{L}_{\mathrm{ST}}}$ and teacher loss $\mathcal{L}_{\mathrm{TE}}$, with varying $M_{\mathrm{TE}}$ and gradient flow training time $t$. In bottom figures, we fit student loss ${\mathcal{L}_{\mathrm{ST}}}$ as a power law function of $\mathcal{L}_{\mathrm{TE}}$. The empirical observations align with \ref{['thrm:main:2layerrelu']}.
  • Figure 5: Abalation of weak-to-strong generalization in 2-layer ReLU networks. We use same setting as \ref{['fig:exp:2layer-full']} and compare the results with smaller student size $M_{\mathrm{ST}}$.
  • ...and 1 more figures

Theorems & Definitions (97)

  • Theorem 3.1: Weak-to-Strong generalization with $2$-layer ReLU Network
  • Theorem 3.2: Weak-to-Strong Generalization with Linear Network
  • Theorem 3.3: $\Theta(1)$-error asymptotics for Linear Networks
  • Definition 1: Shrinking Optimality
  • Lemma 1: Shrinkage Optimality of Gradient Flow Solutions
  • Theorem 4.1: Genereal Limitation of Weak-to-Strong Generalization
  • Corollary 1: Limit of Random Feature Weak to Strong Generalization
  • Theorem 4.2: Limitation of Weak-to-Strong Generalization with a Bounded Student
  • Corollary 2: Limitation of Weak-to-Strong Generalization with Bootstrapping
  • Lemma 2
  • ...and 87 more