Table of Contents
Fetching ...

Weight Ensembling Improves Reasoning in Language Models

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan

TL;DR

The paper identifies diversity collapse during supervised fine-tuning as a key bottleneck limiting Pass@K in reasoning tasks, even as Pass@1 continues to improve. It introduces WiSE-FT, a simple weight-space ensembling technique that interpolates between an early checkpoint and the current finetuned model to recover diversity without sacrificing accuracy, improving both Pass@1 and Pass@K. Empirical results across GSM8K, MATH, AIME, and OpenThoughts-114k demonstrate better test-time scaling (Best@K, Majority Vote) and more data-efficient RL when starting from WiSE-FT, compared to standard SFT or decoding-based mitigation alone. The authors formalize a bias-variance tradeoff for Pass@K, show that diversity collapse leads to bimodal error distributions, and show that WiSE-FT reduces both bias and variance, whereas decoding strategies tend to trade one for the other. Overall, WiSE-FT provides a scalable, complementary approach to maintain diverse, high-quality reasoning traces, enabling more effective inference-time scaling and RL fine-tuning in large language models.

Abstract

We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

Weight Ensembling Improves Reasoning in Language Models

TL;DR

The paper identifies diversity collapse during supervised fine-tuning as a key bottleneck limiting Pass@K in reasoning tasks, even as Pass@1 continues to improve. It introduces WiSE-FT, a simple weight-space ensembling technique that interpolates between an early checkpoint and the current finetuned model to recover diversity without sacrificing accuracy, improving both Pass@1 and Pass@K. Empirical results across GSM8K, MATH, AIME, and OpenThoughts-114k demonstrate better test-time scaling (Best@K, Majority Vote) and more data-efficient RL when starting from WiSE-FT, compared to standard SFT or decoding-based mitigation alone. The authors formalize a bias-variance tradeoff for Pass@K, show that diversity collapse leads to bimodal error distributions, and show that WiSE-FT reduces both bias and variance, whereas decoding strategies tend to trade one for the other. Overall, WiSE-FT provides a scalable, complementary approach to maintain diverse, high-quality reasoning traces, enabling more effective inference-time scaling and RL fine-tuning in large language models.

Abstract

We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

Paper Structure

This paper contains 40 sections, 10 theorems, 47 equations, 28 figures, 3 tables.

Key Result

Proposition 6.1

$\mathbb{E}_{x,y \sim \mathcal{D}}\left[\mathrm{Pass@K}(x)\right] \leq 1 - ((\underbrace{\mathbb{E}_{x,y \sim \mathcal{D}}[1 - \rho_x]}_{\mathrm{Bias}})^2 + \underbrace{\mathrm{Var}(\rho_x)}_{\mathrm{Variance}})^{k/2}$

Figures (28)

  • Figure 1: Pass@k of WiSE-FT versus SFT on GSM8k Gemma-2-2B supervised finetuned and evaluated on GSM8k. At each SFT timestep $t$, we evaluate Pass@k of checkpoint $\boldsymbol w_t$ (in dashed) with its WiSE-FT variant $1/2\cdot\boldsymbol w_t + 1/2\cdot\boldsymbol w_0$ (in solid), where traces are independently sampled with temperature $T = [0.7, 1.0, 1.3, 1.6]$.
  • Figure 2: Pass@1 vs. Pass@K across Interpolation Coefficients We perform WiSEFT with $\delta \in [0.1, 0.9]$ between the first and last checkpoints of model (in legend) finetuned on GSM8K, MATH, and OpenThoughts-114K, then evaluate on GSM8K, MATH500, and AIME24, respectively. Early SFT model observe higher $\mathrm{Pass@K}$ (y-axis) while later SFT model observes higher $\mathrm{Pass@1}$ (x-axis). The interpolated model observe best of both metrics.
  • Figure 3: Downstream Advantages of WiSE-FT:(a) Best@K on MATH500 of the final SFT Gemma-2-2B checkpoint and its WiSE-FT counterpart. (b) Pass@K on AIME24 WiSE-FT after SFT on general purpose reasoning dataset OpenThoughts-114k achieves higher $\mathrm{Pass@K}$ on AIME24. (c) RL Scaling Gemma and Qwen SFT checkpoints further tuned by GRPO on GSM8K and MATH, respectively. RL from the final WiSE-FT model achieves higher $\mathrm{Pass@1}$ with less data compared to GRPO starting from both early and late SFT checkpoints.
  • Figure 4: Diversity Collapse The answer, semantic, and operation diversity of Gemma-2-2B reasoning traces across GSM8k test examples. Colors map to different SFT checkpoints.
  • Figure 5: Pass@k for SFT and RL of Qwen-2.5-0.5B on GSM8K. The purple solid line measures $\mathrm{Pass@K}$ across SFT steps, while the dashed lines correspond to further training different checkpoints by Proximal Policy Optimization (PPO). While Pass@1 continues to improve, Pass@k for larger K can decrease even with RL.
  • ...and 23 more figures

Theorems & Definitions (20)

  • Proposition 6.1
  • Proposition B.1
  • proof
  • Theorem C.1: Collapse to Deterministic Policy
  • proof
  • Definition C.2: Self-enforcing Stochastic Policy Update Rule
  • Lemma C.3: Bad Arm Probability Diminishes Using REINFORCE
  • proof
  • Lemma C.4: Bad Arm Probability Diminishes Using GRPO
  • proof
  • ...and 10 more