Table of Contents
Fetching ...

How to Mitigate Overfitting in Weak-to-strong Generalization?

Junhao Shi, Qinyuan Cheng, Zhaoye Fei, Yining Zheng, Qipeng Guo, Xipeng Qiu

TL;DR

This paper tackles overfitting in weak-to-strong generalization by introducing a two-stage framework that simultaneously improves supervision signals and input question quality. Stage I applies an uncertainty-based self-consistency filter to weak labels, forming Training Set A for finetuning a strong model; Stage II re-evaluates discarded questions with the finetuned model, adds high-confidence samples as Training Set B, and performs final finetuning. Across GSM8K and MATH with Llama 3 and Deepseek, the approach yields substantial performance-gap recovery (PGR) gains, frequently surpassing 100% and outperforming naive weak-to-strong baselines. The work demonstrates the importance of balancing label accuracy with question difficulty/diversity and suggests iterative refinement as a promising direction, albeit with computational considerations and task-domain limitations.

Abstract

Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of \textbf{superalignment}. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR compared to naive weak-to-strong generalization, even achieving up to 100\% PGR on some models.

How to Mitigate Overfitting in Weak-to-strong Generalization?

TL;DR

This paper tackles overfitting in weak-to-strong generalization by introducing a two-stage framework that simultaneously improves supervision signals and input question quality. Stage I applies an uncertainty-based self-consistency filter to weak labels, forming Training Set A for finetuning a strong model; Stage II re-evaluates discarded questions with the finetuned model, adds high-confidence samples as Training Set B, and performs final finetuning. Across GSM8K and MATH with Llama 3 and Deepseek, the approach yields substantial performance-gap recovery (PGR) gains, frequently surpassing 100% and outperforming naive weak-to-strong baselines. The work demonstrates the importance of balancing label accuracy with question difficulty/diversity and suggests iterative refinement as a promising direction, albeit with computational considerations and task-domain limitations.

Abstract

Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of \textbf{superalignment}. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR compared to naive weak-to-strong generalization, even achieving up to 100\% PGR on some models.

Paper Structure

This paper contains 35 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of different weak-to-strong generalization approaches. (a) Conventional approach with noisy labels from weak model, indicated by black dots; (b) Simple filtering approach that discards too many valuable hard samples; (c) Our framework can maintain both supervision quality and question quality.
  • Figure 2: Overview of our two-stage training framework. Stage I (top): The raw question set is filtered based on weak model's consistency (). High-consistency questions are used to generate Training Set A, which is then used for finetuning the strong model (). Stage II (bottom): Previously discarded questions are re-evaluated and refined using the finetuned strong model from Stage I (). High-consistency questions are selected to form Training Set B, which is then combined with Set A for final finetuning (). Here represents weak model, represents primary strong model, represents Stage I finetuned model, and represents final finetuned model.
  • Figure 3: The relationship between supervision correctness and filtering threshold. As the filtering threshold increases, the supervision correctness (measured by label accuracy) shows a consistent upward trend.
  • Figure 4: (a) The upper row shows the performance trajectory and PGR across different stages (Baseline, Stage I, and Stage II). The solid lines represent model performance (left y-axis), while the dash-dotted lines show PGR values (right y-axis). (b) The lower row demonstrates the impact of different filtering thresholds on model performance, with triangles representing Stage I results and circles representing Stage II results. For each experimental setting, points with the same color correspond to the same Stage I filtering threshold. Results show consistent improvement patterns across all model configurations, with Stage II generally achieving better performance than Stage I.
  • Figure 5: Impact of filtering threshold on question difficulty distribution. As the threshold increases, the proportion of difficult questions (Levels 4-5) decreases, while easier questions (Levels 1-2) increase, resulting in a decline in average difficulty from 3.48 to 2.66.
  • ...and 4 more figures