Table of Contents
Fetching ...

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish

TL;DR

This work modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy, and offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search.

Abstract

Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

TL;DR

This work modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy, and offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search.

Abstract

Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.
Paper Structure (20 sections, 5 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 5 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Example outputs from LLaDA with standard sampling (left) and our approach (right), both with a temperature of 1. For this example, the base sampler achieved a Pass@16 of 0, while our model found 3 valid solutions. Despite having the same temperature, standard sampling leads to mode collapse over an incorrect approach, whereas our method optimises to select samples distinct from those previously explored.
  • Figure 2: An overview of our proposed diverse sampling framework, using our orthogonal diversity loss. Given a full batch of $n$ samples, feature vectors $F(\mathbf{x_i})$ are extracted for the logits $x_i$ for each sample $i \in \{1, \dots, n\}$. We then loop through each sample, compute an orthogonal basis for all previous sample features $B_{<i}$, and calculate the projection of the current sample features onto this basis. We then take the norm of the residual with respect to this projection as the diversity objective. Optimising this then moves the sample in a direction diverse compared to previous samples.
  • Figure 3: Average batch diversity (1 - cosine similarity of sentence embeddings) for GSM8K (top) and HumanEval (bottom), comparing ODD to the baseline ($\alpha=0$, as indicated by dashed lines). ODD increases diversity at low temperatures and acts as a coherence filter at high temperatures (evidenced by the improved Pass@16 in Table \ref{['tab:results_final']}), effectively balancing exploration and quality.
  • Figure 4: Pass@1 vs Pass@16 Pareto frontiers. $\alpha=0$ represents standard LLaDA sampling (baseline). Data points represent decreasing temperatures from right ($\theta=0$) to left ($\theta=2$). Top (GSM8K): ODD trades off individual accuracy (shifting left) for superior batch coverage (shifting up). Bottom (HumanEval): For $\alpha \le 16$, ODD achieves a Pareto improvement, boosting coverage without loss of quality.
  • Figure 5: Empirical Pass@$k$ for ODD over GSM8K (top) and HumanEval (bottom). We observe a consistent improvement relative to baseline for ODD. Given the batch size invariance of ODD, we expect this trend would continue for higher batch sizes.
  • ...and 1 more figures