Table of Contents
Fetching ...

GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

Divij Handa, Mihir Parmar, Aswin RRV, Md Nayem Uddin, Hamid Palangi, Chitta Baral

TL;DR

This work tackles the limited diversity of solutions produced by inference-time repeated sampling (RS) in large language models. It introduces GuidedSampling, a two-phase framework that explicitly explores diverse concepts (exploration) before generating final solutions conditioned on those concepts (generation), coupled with theoretical bounds that explain when it outperforms RS. Empirically, GuidedSampling delivers substantial gains in pass@k across math, science, coding, and Olympiad benchmarks, and increases the diversity of attempted concepts per instance. The authors further demonstrate the utility of GuidedSampling for post-training data synthesis, showing improvements when finetuning on trajectories generated by GuidedSampling and evidencing some cross-domain generalization. Overall, this approach offers a principled and efficient way to leverage inference-time compute for richer, more diverse problem-solving strategies in LLMs.

Abstract

Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new inference algorithm, GuidedSampling, which decouples the exploration and generation phases during inference, increasing diversity of generated candidate solutions. The exploration phase identifies multiple concepts that can be utilized to solve the problem, while the generation phase applies a specific concept to provide final solution candidates. We first define the theoretical bounds of GuidedSampling and then empirically demonstrate that it improves the performance of base model at pass@50 by on an average ~21.6% across various benchmarks compared to RS. Furthermore, models trained on trajectories of GuidedSampling exhibit substantial performance improvements at pass@5 by on an average ~9.7%, compared to models trained on traditional RS. Additionally, models trained with GuidedSampling increases the average number of concepts per instance (1.67 -> 3.03), yielding a diverse set of candidates than traditional RS.

GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

TL;DR

This work tackles the limited diversity of solutions produced by inference-time repeated sampling (RS) in large language models. It introduces GuidedSampling, a two-phase framework that explicitly explores diverse concepts (exploration) before generating final solutions conditioned on those concepts (generation), coupled with theoretical bounds that explain when it outperforms RS. Empirically, GuidedSampling delivers substantial gains in pass@k across math, science, coding, and Olympiad benchmarks, and increases the diversity of attempted concepts per instance. The authors further demonstrate the utility of GuidedSampling for post-training data synthesis, showing improvements when finetuning on trajectories generated by GuidedSampling and evidencing some cross-domain generalization. Overall, this approach offers a principled and efficient way to leverage inference-time compute for richer, more diverse problem-solving strategies in LLMs.

Abstract

Repeated Sampling (RS) is a simple inference-time algorithm that has been shown to improve model performance on complex tasks. Although it is an effective way of scaling inference time, it often struggles to generate diverse solution candidates, frequently relying on the same underlying approach to solve the problem and thus producing redundant samples. To address this limitation, we propose a new inference algorithm, GuidedSampling, which decouples the exploration and generation phases during inference, increasing diversity of generated candidate solutions. The exploration phase identifies multiple concepts that can be utilized to solve the problem, while the generation phase applies a specific concept to provide final solution candidates. We first define the theoretical bounds of GuidedSampling and then empirically demonstrate that it improves the performance of base model at pass@50 by on an average ~21.6% across various benchmarks compared to RS. Furthermore, models trained on trajectories of GuidedSampling exhibit substantial performance improvements at pass@5 by on an average ~9.7%, compared to models trained on traditional RS. Additionally, models trained with GuidedSampling increases the average number of concepts per instance (1.67 -> 3.03), yielding a diverse set of candidates than traditional RS.

Paper Structure

This paper contains 46 sections, 1 theorem, 17 equations, 5 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Let $P_{RS}(y^*\;|\;x)$ be the probability of generating a correct solution through Repeated Sampling and $P_{GS}(y^*\;|\;x)$ be the probability of generating a correct solution through GuidedSampling. Under Assumption ass:sharpening, $P_{GS}(y^*\;|\;x) > P_{RS}(y^*\;|\;x)$ iff the following conditi where $P(\mathcal{C}_{r}\;|\;x) = \sum_{c \in \mathcal{C}_{r}} \pi_{concept}(c\;|\;x)$ is the proba

Figures (5)

  • Figure 1: Pass@50 improvements with best performing base model using GuidedSampling.
  • Figure 2: GuidedSampling enhances exploration during inference by first generating a set of diverse concepts or theorems to guide subsequent generations of solutions. Unlike repeated sampling (RS), where the model generates the final solution, GuidedSampling separates these phases.
  • Figure 3: Distribution of the number of concepts used by Llama-3.2-3B-Instruct for 100 candidates. 37% of the questions are attempted with just one concept, while less than 36% of the questions have more than two concepts.
  • Figure 4: GuidedSampling forces exploration during inference-time, resulting in $16.01\%$ average pass@k improvement compared to repeated sampling. We observe an average improvement of $21.8\%$ on MATH, $11.87\%$ on GPQA-Diamond, $11.28\%$ on HumanEval, and $3.08\%$ on OlympiadBench. First row: For Llama-3.2-3B-Instruct, Second row: For Qwen2.5-3B-Instruct.
  • Figure 5: Pass@50 performance variation with different exploration (number of concepts $K$) and generation (samples per idea $M$) compute allocations, given a fixed total compute of 100 calls ($M=100/K$). Increasing exploration initially helps, but performance declines when the generation budget per idea becomes too small. At $K=0$, GuidedSampling becomes traditional RS. The first row shows results for Llama-3.2-3B-Instruct, and the second for Qwen2.5-3B-Instruct.

Theorems & Definitions (3)

  • Definition 1: Notation
  • Theorem 1
  • proof