Table of Contents
Fetching ...

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava

TL;DR

This work reframes inference-time scaling of LLMs as posterior inference over a state-space model and applies particle filtering to maintain a diverse set of candidate trajectories guided by a reward model. By sampling from the typical set rather than greedily chasing the top score, the method robustly handles reward-model noise and multi-modality, achieving superior scaling (4–16x) over deterministic search baselines on math and general reasoning tasks. Empirical results show small open models scaled with PF can match or exceed stronger proprietary models on MATH500 and AIME 2024, with strong performance on non-math datasets as well. The study also introduces extensions like Particle Gibbs and parallel tempering and provides thorough ablations on PRMs, aggregation, temperature, and compute budgets, linking probabilistic inference techniques with ITS for more robust future algorithms.

Abstract

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code, videos, and further information available at https://probabilistic-inference-scaling.github.io.

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

TL;DR

This work reframes inference-time scaling of LLMs as posterior inference over a state-space model and applies particle filtering to maintain a diverse set of candidate trajectories guided by a reward model. By sampling from the typical set rather than greedily chasing the top score, the method robustly handles reward-model noise and multi-modality, achieving superior scaling (4–16x) over deterministic search baselines on math and general reasoning tasks. Empirical results show small open models scaled with PF can match or exceed stronger proprietary models on MATH500 and AIME 2024, with strong performance on non-math datasets as well. The study also introduces extensions like Particle Gibbs and parallel tempering and provides thorough ablations on PRMs, aggregation, temperature, and compute budgets, linking probabilistic inference techniques with ITS for more robust future algorithms.

Abstract

Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code, videos, and further information available at https://probabilistic-inference-scaling.github.io.

Paper Structure

This paper contains 39 sections, 2 theorems, 5 equations, 8 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Let $\{(w^{(i)}, x^{(i)})\}$ be weighted particles from Algorithm alg:pf and $\mathrm{is\_correct}(x)$ is a function to check the correctness of response $x$. We have where the expectation is over the randomness of the algorithm itself.

Figures (8)

  • Figure 1: A true example of PRM assigning a lower score to the first step of a solution that turns out to be correct. In deterministic scaling methods, this solution would have been discarded in favor for one that had a higher initial PRM score but turned out to be incorrect.
  • Figure 2: Inference-time scaling with particle filtering: initialize $n$ particles, generate a step for each, score with the PRM, resample via softmax-weighted scores, and repeat until full solutions are formed.
  • Figure 3: State-space model for inference-time scaling. $c$ is a prompt, $x_1, \dots, x_T$ are LLM outputs, and $o_1, \dots, o_T$ are "observed" acceptances from a reward model. We estimate the latent states conditioned on $o_t = 1$ for all $t$.
  • Figure 4: Accuracy vs. Generation Budget across models using different inference-time strategies.
  • Figure 5: Comparison of PF and Particle Gibbs with different numbers of iterations, evaluated on a 100-question subset of the MATH-500 dataset using Llama-3.2-1B-Instruct as the policy model.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: Unbiasedness of Expected Accuracy
  • Theorem 2: Unbiasedness of Expected Accuracy
  • proof