Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Feng Chen; Allan Raventos; Nan Cheng; Surya Ganguli; Shaul Druckmann

Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann

TL;DR

The paper demonstrates that standard cross-entropy fine-tuning can misalign with pass@N test-time search, causing performance to degrade as test-time compute increases. It introduces Direct Coverage Optimization (DCO), an objective that directly maximizes the chance of finding the correct answer within N samples, and shows that this reduces overconfidence and yields Pareto-optimal tradeoffs between exploration and exploitation. Through experiments on MATH, MiniF2F, and LeanDojo theorem proving, DCO and its variants (DCOstep and DCOa) consistently improve pass@N performance, especially at large N, and demonstrate the value of co-designing training-time objectives with test-time search strategies. The work argues for end-to-end consideration of training and inference-time algorithms to unlock scalable mathematical reasoning in LLMs.

Abstract

Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be ${\it misaligned}$ with pass@N in that pass@N accuracy ${\it decreases}$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

TL;DR

Abstract

Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (16)