Weak-to-Strong Generalization Even in Random Feature Networks, Provably
Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro
TL;DR
The paper proves that weak-to-strong generalization does not require complex, highly-parameterized learners: it can occur in two-layer random-feature networks where a large student trained with early stopping on teacher-provided labels significantly outperforms a smaller teacher. It provides precise upper bounds showing multiplicative improvements and, under Gaussian universality, polynomial improvements (e.g., ${ m L}_{ST} = ilde{O}({ m L}_{TE}^{1.49})$ for ReLU RFs and ${ m L}_{ST}= ilde{O}({ m L}_{TE}^2)$ for linear RFs), while also establishing quadratic lower bounds that cap how much improvement is possible. The analysis uses a gradient-flow framework, kernel (neural tangent) perspectives, and a novel teacher-student feature-alignment quantity $oldsymbol{ kappa}_S$ to quantify misalignment and its impact on generalization. A key takeaway is that early stopping enables denoising of the teacher’s noise directions, facilitating strong improvements, but there are fundamental limits: in RF models, ${ m L}_{ST} leq c imes { m L}_{TE}^eta$ with $eta>2$. The work thus clarifies the mechanism and boundaries of W2S, offering theoretical grounding for observed phenomena in larger-scale systems and guiding future exploration of inductive biases and stopping criteria.
Abstract
Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
