Table of Contents
Fetching ...

Quantifying the Gain in Weak-to-Strong Generalization

Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur

TL;DR

This work presents a theoretical framework for understanding weak-to-strong generalization and shows that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.

Abstract

Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. Our theory reveals several curious algorithmic insights. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error. We validate our theoretical findings through various empirical assessments.

Quantifying the Gain in Weak-to-Strong Generalization

TL;DR

This work presents a theoretical framework for understanding weak-to-strong generalization and shows that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model.

Abstract

Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by the misfit error incurred by the strong model on labels generated by the weaker model. Our theory reveals several curious algorithmic insights. For instance, we can predict the amount by which the strong model will improve over the weak model, and also choose among different weak models to train the strong model, based on its misfit error. We validate our theoretical findings through various empirical assessments.
Paper Structure (23 sections, 3 theorems, 39 equations, 7 figures, 3 tables)

This paper contains 23 sections, 3 theorems, 39 equations, 7 figures, 3 tables.

Key Result

Theorem 1

Let $h^\star: \mathbb{R}^d \to \mathbb{R}^{d^\star}$ be a ground truth representation map, and let $f^\star:\mathbb{R}^{d^\star}\to \mathbb{R}$ be a finetuning task of interest. Let $h_s:\mathbb{R}^d \to \mathbb{R}^{d_s}$ and $h_w:\mathbb{R}^d \to \mathbb{R}^{d_w}$ be the strong and weak model repre be the function learnt by the strong model under weak supervision. Lastly, let us assume that there

Figures (7)

  • Figure 1: $f_{sw} \circ h_s$ is the projection of $f_w \circ h_w$ onto the convex set $V_s$.
  • Figure 2: (a),(b),(c) Experiments on synthetic data. (d),(e),(f) QSAR tasks over MolBERT representations on the ESOL, FreeSolv and Lipop datasets. For each dataset, ChemBench charleshen_2020_4054866 provides three different train, test and validation splits; multiple points of the same color correspond to weak-to-strong supervision for the same weak model (as specified in legend) across these splits.
  • Figure 3: Strong-to-weak generalization. The roles of the weak and strong models have reversed.
  • Figure 4: Non-realizable weak-to-strong generalization where $f^\star \circ h^\star \notin V_s$, and we use a finite sample to perform weak-to-strong supervision. The Pythagorean theorem, along with uniform convergence and triangle inequalities, yield the desired result.
  • Figure 5: Results on the Essay Scoring dataset. Each plot corresponds to the task of predicting the score based on a different rubric.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 1: Weak-to-Strong Generalization under Realizability
  • Theorem 2: Weak-to-Strong Generalization under Non-Realizability and Finite Samples
  • Claim 3
  • proof
  • proof : Proof of \ref{['thm:realizable']}
  • proof : Proof of \ref{['thm:non-realizable-finite-samples']}
  • Lemma 4: Uniform Convergence
  • proof
  • Claim 5: "Triangle Inequality"
  • proof