Table of Contents
Fetching ...

On the Effect of Sampling Diversity in Scaling LLM Inference

Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Weiyang Liu, Haifeng Chen, Xiang Zhang, Wei Cheng

TL;DR

The paper tackles how sampling diversity in prompt design affects inference-time scaling of large language models. It develops a theoretical framework showing that diversified sampling improves Best-of-$N$ performance and introduces a diversity–fidelity trade-off to guide perturbation design. It analyzes when diversification is effective and when it may fail (notably with majority voting), and it substantiates these insights with extensive cross-domain experiments in reasoning, mathematics, and code generation, yielding practical guidelines for deploying sampling diversity at inference time.

Abstract

Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-N scaling, showing that responses generated from diverse prompts after Best-of-N selection exhibit significantly lower error rates than those produced from stationary prompts. Building on this analysis, we derive a diversity-fidelity trade-off principle, that guides the design of sampling strategies introducing diversity. From this guidance, we instantiate a family of effective perturbation styles. We theoretically and empirically characterize when diversified exploration remains effective, demonstrating that it works under a variety of conditions, and we further show that under majority voting, diversity may vanish. Finally, we systematically evaluate how effective sampling diversity is and show that, when applied appropriately in different contexts, it yields relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation. Overall, this work provides a systematic analysis that offers a theoretical and empirical foundation for understanding how sampling diversity affects LLM inference-time scaling.

On the Effect of Sampling Diversity in Scaling LLM Inference

TL;DR

The paper tackles how sampling diversity in prompt design affects inference-time scaling of large language models. It develops a theoretical framework showing that diversified sampling improves Best-of- performance and introduces a diversity–fidelity trade-off to guide perturbation design. It analyzes when diversification is effective and when it may fail (notably with majority voting), and it substantiates these insights with extensive cross-domain experiments in reasoning, mathematics, and code generation, yielding practical guidelines for deploying sampling diversity at inference time.

Abstract

Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-N scaling, showing that responses generated from diverse prompts after Best-of-N selection exhibit significantly lower error rates than those produced from stationary prompts. Building on this analysis, we derive a diversity-fidelity trade-off principle, that guides the design of sampling strategies introducing diversity. From this guidance, we instantiate a family of effective perturbation styles. We theoretically and empirically characterize when diversified exploration remains effective, demonstrating that it works under a variety of conditions, and we further show that under majority voting, diversity may vanish. Finally, we systematically evaluate how effective sampling diversity is and show that, when applied appropriately in different contexts, it yields relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation. Overall, this work provides a systematic analysis that offers a theoretical and empirical foundation for understanding how sampling diversity affects LLM inference-time scaling.

Paper Structure

This paper contains 58 sections, 4 theorems, 49 equations, 19 figures, 3 tables.

Key Result

Theorem 3.5

Under Hypotheses hyp:variation and hyp:fidelity-hybrid, there exists a positive sequence $C_N=\Omega ( \hat{\mu}_1^2 N/(1+\epsilon) )$, increasing in $N$, such that

Figures (19)

  • Figure 1: A brief sketch of (a) direct sampling without diversification and (b) diversified sampling.
  • Figure 2: Effect of perturbation relevance. Relationship between perturbation-question similarity and task performance. EM rate (math) and Pass rate (code) measured from 40 solutions under five perturbation types (1–5). Results are obtained with GPT-4o-mini and are reported as the mean and standard deviation over five independent runs.
  • Figure 3: Sweep over a range of increasing temperature settings on Humaneval using GPT-4o-mini. Higher temperatures generally improve Pass@k for direct sampling, and diversified sampling provides further gains on top of these temperature-induced improvements at each setting.
  • Figure 4: Scaling curves of the Dual strategy across thinker models, with stronger models yielding higher performance.
  • Figure 5: Scaling curves for the Dual strategy as injection cardinality increases.
  • ...and 14 more figures

Theorems & Definitions (11)

  • Remark 3.2: Intuition for Hypothesis \ref{['hyp:variation']}
  • Remark 3.4: Intuition for Hypothesis \ref{['hyp:fidelity-hybrid']}
  • Theorem 3.5: Diversity improves Best-of-$N$
  • Proposition 5.1
  • Remark B.2
  • Remark B.4
  • Theorem B.5: Hybrid diversity improves Best-of-$K$
  • proof
  • Remark B.6: Moment effect and asymptote
  • Theorem B.7: ORM top-$k$ recall under a margin
  • ...and 1 more