Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria; Noah A. Smith

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria, Noah A. Smith

TL;DR

QAlign introduces a test-time alignment framework that samples from the per-prompt optimal aligned distribution using MCMC, avoiding LM retraining or logit access. By leveraging a Quest-based suffix-proposal and Metropolis-Hastings sampling, it achieves improved alignment as compute increases, outperforming BoN, MV, WMV, and even DPO under comparable budgets. The method demonstrates consistent gains on mathematical reasoning benchmarks and general alignment across diverse datasets, highlighting its practical value for deploying off-the-shelf LMs with private reward signals. This approach broadens the toolkit for test-time optimization, offering scalable, flexible alignment without altering the underlying model weights.

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

TL;DR

Abstract

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)