Table of Contents
Fetching ...

A transfer learning framework for weak-to-strong generalization

Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya'acov Ritov, Mikhail Yurochkin, Yuekai Sun

TL;DR

The paper addresses weak-to-strong generalization in LLM alignment by recasting it as transfer learning of a latent concept prior from a weaker, aligned model to a stronger, unaligned model. It proves that naive fine-tuning on weak labels is fundamentally limited and introduces a refinement-based approach, leveraging in-context learning to elicit latent knowledge and produce refined supervision that enables the strong model to realize the target concept. Theoretical results show finite-sample guarantees and exponential decay of misalignment with respect to the refinement batch size, while experiments across persona transfer, mathematical reasoning, and explanation tasks demonstrate practical improvements over naive fine-tuning and baselines. The work highlights a principled route to superalignment under a convex-hull assumption, with implications for scalable, safer alignment of increasingly capable LLMs.

Abstract

Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.

A transfer learning framework for weak-to-strong generalization

TL;DR

The paper addresses weak-to-strong generalization in LLM alignment by recasting it as transfer learning of a latent concept prior from a weaker, aligned model to a stronger, unaligned model. It proves that naive fine-tuning on weak labels is fundamentally limited and introduces a refinement-based approach, leveraging in-context learning to elicit latent knowledge and produce refined supervision that enables the strong model to realize the target concept. Theoretical results show finite-sample guarantees and exponential decay of misalignment with respect to the refinement batch size, while experiments across persona transfer, mathematical reasoning, and explanation tasks demonstrate practical improvements over naive fine-tuning and baselines. The work highlights a principled route to superalignment under a convex-hull assumption, with implications for scalable, safer alignment of increasingly capable LLMs.

Abstract

Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.
Paper Structure (25 sections, 8 theorems, 77 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 25 sections, 8 theorems, 77 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Proposition 3.2

Consider the case of biased weak supervision; if $\epsilon_P^2$ and $\epsilon_{Q'}^2$ denote the squared bias of the source and weak models, then the following lower bound on the MSE of estimators produced by naive fine-tuning holds:

Figures (4)

  • Figure 1: Comparing performance of naive fine-tuning and our ICL method on tinyAlpacaEval. Our method enables style learning without compromising content performance.
  • Figure 2: Comparing performance of naive fine-tuning and our ICL method on tinyTruthfulQA. Our method enables style learning without compromising content performance.
  • Figure 3: From left to right: model accuracy on GSM8K with 3.5-Turbo, model accuracy on MATH with 3.5-Turbo, model accuracy on GSM8K with 4o-mini, model accuracy on MATH with 4o-mini.
  • Figure 4: Comparing performance of naive fine-tuning and our ICL method on science questions created by GPT4. Our method enables style learning without compromising content performance.

Theorems & Definitions (18)

  • Example 2.2: Persona Learning
  • Definition 3.1
  • Proposition 3.2
  • Example 3.3: Persona Learning Test
  • Proposition 3.4
  • Example 4.1: Persona Learning Label Re-sample
  • Theorem 4.3
  • Proposition A.3: wang2024largelanguagemodelslatent
  • Theorem A.4: pathak2024transformers
  • Lemma B.2: xie2021Explanation Theorem 1 (part 1)
  • ...and 8 more