Table of Contents
Fetching ...

Ensembling Language Models with Sequential Monte Carlo

Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira

TL;DR

A unified framework for composing language models into f-ensemble distributions for a wide range of functions, and a byte-level sequential Monte Carlo algorithm that operates in a shared character space is proposed, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit.

Abstract

Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

Ensembling Language Models with Sequential Monte Carlo

TL;DR

A unified framework for composing language models into f-ensemble distributions for a wide range of functions, and a byte-level sequential Monte Carlo algorithm that operates in a shared character space is proposed, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit.

Abstract

Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing language models into -ensemble distributions for a wide range of functions . To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of -ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.
Paper Structure (30 sections, 7 theorems, 33 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 7 theorems, 33 equations, 10 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $D_\alpha$ be the $\alpha$-divergence with parameter $\alpha \in \mathbb{R} \setminus \{0, 1\}$. The unique consensus distribution ${\color{MacroColor} \Phi{}}\xspace^*$ that minimizes the expert-weighted loss in eq:variational-ensembling is the generalized mean of the experts, with power parame where $Z$ is the normalizer ensuring $\sum_{{{\color{MacroColor} \boldsymbol{x}}}\in {{\color{Macro

Figures (10)

  • Figure 1: Prompt intersection on GPT-2. We generate 200 strings from the normalized token-level product of "My favorite physicist is" and "My favorite author is" prompts, then score each under the local ensemble $p_{\mathrm{ens, local}}$ (string probabilities emerging from locally normalized token-level product ensembling) and global ensemble $p_{\mathrm{ens, global}}$ (normalized product of string probabilities). Left: scatter plot of log probabilities with color indicating the generated strings' global probability relative to their local ones. We also plot completions from the author prompt ( Xygp ) and the physicist prompt ( Xygp ) that don't overlap with the ensemble's completions ( Xygp ). Right: Correlation with the explicitly formulated intersection constraint probability $p(\cdot \mid \text{"My favorite physicist and author is"})$.
  • Figure 2: ${\color{MacroColor} f}\xspace$-ensembles for token-level SMC. Values show change in accuracy (%) relative to the best single-prompt baseline. $^{\dagger}$: best in column. Bold: significant improvement over the best base prompt (non-overlapping 95% CIs). Q=Qwen, P=Phi, L=Llama.
  • Figure 3: Representative relationship between expected accuracy and the estimated log marginal likelihood, $\log \widehat{Z}$, on BBH for Llama. Each datapoint represents (sample, particle) configurations aggregated across seeds. Both metrics are $z$-scored per example to normalize for variations in problem difficulty and scale. Points are colored by the number of particles used for estimation. Marginal plots display the kernel density estimates for the respective axes. The grey line and shaded band indicate a linear regression fit with a 95% confidence interval. Pearson correlation coefficients $\rho$ are annotated in the top-left ($^{***}p<0.001$, $^{*}p<0.05$).
  • Figure 4: Scatter plots comparing expected accuracy of two prompt templates for all datasets and models. Color indicates product ensemble (token-level SMC) expected accuracy. Points with higher alpha indicate cases where the ensemble exceeds the best single template. Marker size scales with the improvement margin.
  • Figure 5: Particle study for JSON. We evaluate the approximation quality estimate $\mathbb{E}[\log \widehat{Z}]$ as a function of the number of particles being used for SMC.
  • ...and 5 more figures

Theorems & Definitions (17)

  • Definition 4.1
  • Theorem 4.1: $\alpha$-Divergence Ensembles
  • proof
  • Definition 5.1
  • Proposition 5.1
  • proof
  • Proposition 5.2
  • Definition 3.1
  • Proposition 4.1
  • proof
  • ...and 7 more