Weight Ensembling Improves Reasoning in Language Models

Xingyu Dang; Christina Baek; Kaiyue Wen; Zico Kolter; Aditi Raghunathan

Weight Ensembling Improves Reasoning in Language Models

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan

TL;DR

The paper identifies diversity collapse during supervised fine-tuning as a key bottleneck limiting Pass@K in reasoning tasks, even as Pass@1 continues to improve. It introduces WiSE-FT, a simple weight-space ensembling technique that interpolates between an early checkpoint and the current finetuned model to recover diversity without sacrificing accuracy, improving both Pass@1 and Pass@K. Empirical results across GSM8K, MATH, AIME, and OpenThoughts-114k demonstrate better test-time scaling (Best@K, Majority Vote) and more data-efficient RL when starting from WiSE-FT, compared to standard SFT or decoding-based mitigation alone. The authors formalize a bias-variance tradeoff for Pass@K, show that diversity collapse leads to bimodal error distributions, and show that WiSE-FT reduces both bias and variance, whereas decoding strategies tend to trade one for the other. Overall, WiSE-FT provides a scalable, complementary approach to maintain diverse, high-quality reasoning traces, enabling more effective inference-time scaling and RL fine-tuning in large language models.

Abstract

We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

Weight Ensembling Improves Reasoning in Language Models

TL;DR

Abstract

Weight Ensembling Improves Reasoning in Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (28)

Theorems & Definitions (20)