Table of Contents
Fetching ...

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria, Noah A. Smith

TL;DR

QAlign introduces a test-time alignment framework that samples from the per-prompt optimal aligned distribution using MCMC, avoiding LM retraining or logit access. By leveraging a Quest-based suffix-proposal and Metropolis-Hastings sampling, it achieves improved alignment as compute increases, outperforming BoN, MV, WMV, and even DPO under comparable budgets. The method demonstrates consistent gains on mathematical reasoning benchmarks and general alignment across diverse datasets, highlighting its practical value for deploying off-the-shelf LMs with private reward signals. This approach broadens the toolkit for test-time optimization, offering scalable, flexible alignment without altering the underlying model weights.

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

TL;DR

QAlign introduces a test-time alignment framework that samples from the per-prompt optimal aligned distribution using MCMC, avoiding LM retraining or logit access. By leveraging a Quest-based suffix-proposal and Metropolis-Hastings sampling, it achieves improved alignment as compute increases, outperforming BoN, MV, WMV, and even DPO under comparable budgets. The method demonstrates consistent gains on mathematical reasoning benchmarks and general alignment across diverse datasets, highlighting its practical value for deploying off-the-shelf LMs with private reward signals. This approach broadens the toolkit for test-time optimization, offering scalable, flexible alignment without altering the underlying model weights.

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Paper Structure

This paper contains 36 sections, 42 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Average error rate across multiple evaluation datasets (GSM8K, MATH500, MMLU-Redux, TruthfulQA, and IFEval) as a function of inference-time floating point operations (FLOPS) in log scale. We compare $\bullet$QAlign with Tülu3-8B-SFT against four baselines: $\blacktriangle$ majority vote (MV) Tülu3-8B-DPO, and applied to Tülu3-8B-SFT the methods $\bullet$ best-of-$n$ (BoN), $\bullet$ MV, and $\bullet$ weighted MV (WMV). All experiments use temperature 1.0 with reasoning included in model outputs. The Tülu3-8B-DPO model results from preference finetuning Tülu3-8B-SFT (approximately $1.75 \times 10^{19}$ FLOPs). The costs of this process are not accounted for in this plot.
  • Figure 2: Average accuracy vs. floating point operations (FLOPS) in log scale. We compare $\bullet$QAlign with Llama-3.1-8B-Instruct against three baselines also applied to Llama-3.1-8B-Instruct: $\bullet$ best-of-$n$ (BoN), $\bullet$ majority vote (MV), and $\bullet$ weighted MV (WMV). Left: Error rate (lower is better) on GSM8K test dataset. Right: Error rate on GSM-Symbolic test dataset. All experiments use temperature 1.0 with reasoning included in model outputs.
  • Figure 3: Distribution of the normalized maximum reward $({r^{(n)}_{\text{max}} - a_n})/{b_n}$ for varying $n$, overlaid with the standard Gumbel distribution. The empirical distribution is estimated using 10,000 trials, each consisting of $n$ random samples drawn from a Normal distribution. The fit between the empirical distribution and the Normal distribution improves as $n$ increases, showing good agreement for $n \geq 32$.
  • Figure 4: Distribution of the normalized maximum reward $({r^{(n)}_{\text{max}} - \mu_{\pi,d}(x,\beta^\ast) })/\sigma_d(x)$ for varying $n$, overlaid with the standard Normal distribution. The empirical distribution is estimated using 10,000 trials, each consisting of $n$ random samples drawn from a Normal distribution. While the mode of our approximation matches, the approximation does not capture the variance of the empirical distribution.
  • Figure 5: Histogram of rewards assigned by Tülu3-8B-RM to $1,024$ responses generated by Tülu3-8B-SFT for $9$ randomly sampled prompts from GSM8K. For each prompt, we fit a two-component Gaussian mixture model to characterize the reward distribution.
  • ...and 1 more figures