Table of Contents
Fetching ...

Token-level Ensembling of Models with Different Vocabularies

Rachel Wicks, Kartik Ravisankar, Xinchen Yang, Philipp Koehn, Matt Post

TL;DR

This work presents Agreement-Based Ensembling (ABE), an inference-time method to ensemble models with different vocabularies without extra training. By maintaining a shared detokenized global hypothesis and coordinating token selection through a cross-model search, ABE achieves token-level agreement across heterogeneous vocabularies and architectures, including encoder-decoder models and LLMs. Evaluated on machine translation across custom MT, public MT, and LLMs, ABE frequently yields improvements in BLEU and COMET over the best individual model and often surpasses interpolation baselines, though performance varies with model quality and language. The method is simple to implement, architecture-agnostic, and expands the applicability of ensembling to open-vocabulary settings, with potential to constrain hallucinations and guide future research on model pairings and decoding strategies.

Abstract

Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token's probability distribution is derived from a weighted sum of the distributions of each individual model. This requires the underlying models to share the same subword vocabulary, limiting the applicability of ensembling, since many open-sourced models have distinct vocabularies. In research settings, experimentation or upgrades to vocabularies may introduce multiple vocabulary sizes. This paper proposes an inference-time only algorithm that allows for ensembling models with different vocabularies, without the need to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models \textit{agree} in their surface form. We apply this technique to combinations of traditional encoder-decoder models and decoder-only LLMs and evaluate on machine translation. In addition to expanding to model pairs that were previously incapable of token-level ensembling, our algorithm frequently improves translation performance over either model individually.

Token-level Ensembling of Models with Different Vocabularies

TL;DR

This work presents Agreement-Based Ensembling (ABE), an inference-time method to ensemble models with different vocabularies without extra training. By maintaining a shared detokenized global hypothesis and coordinating token selection through a cross-model search, ABE achieves token-level agreement across heterogeneous vocabularies and architectures, including encoder-decoder models and LLMs. Evaluated on machine translation across custom MT, public MT, and LLMs, ABE frequently yields improvements in BLEU and COMET over the best individual model and often surpasses interpolation baselines, though performance varies with model quality and language. The method is simple to implement, architecture-agnostic, and expands the applicability of ensembling to open-vocabulary settings, with potential to constrain hallucinations and guide future research on model pairings and decoding strategies.

Abstract

Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token's probability distribution is derived from a weighted sum of the distributions of each individual model. This requires the underlying models to share the same subword vocabulary, limiting the applicability of ensembling, since many open-sourced models have distinct vocabularies. In research settings, experimentation or upgrades to vocabularies may introduce multiple vocabulary sizes. This paper proposes an inference-time only algorithm that allows for ensembling models with different vocabularies, without the need to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models \textit{agree} in their surface form. We apply this technique to combinations of traditional encoder-decoder models and decoder-only LLMs and evaluate on machine translation. In addition to expanding to model pairs that were previously incapable of token-level ensembling, our algorithm frequently improves translation performance over either model individually.

Paper Structure

This paper contains 24 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Agreement-Based Ensembling (ABE) enables ensembling among models with different vocabularies. Token generation for each beam item is constrained to tokens with agreeing detokenized forms.
  • Figure 2: A global state maintains the shared detokenized string, which is determined by the local hypotheses. Associated with each model is a flag denoting whether the model is stalled ($\times$) or able to generate (✓). In stalled steps (§ \ref{['section:stalling']}), only the trailing model(s) generate(s) a token, catching up with the shared string. The stalled model is prevented from generating additional content.
  • Figure 3: The first 12 candidates in ABE search space for unstalled $m_1$, $m_2$. Each model's vocabulary is sorted by score. The top left corner is pushed onto a heap with its weighted score, $0.58$. We present probabilities here for simplicity. In practice, each token score is the cumulative log prob of the local hypothesis with this token as the extension. The loop then pops from the heap, checks for agreement, and adds unvisited neighbors onto the heap. Numbers denote visitation order.
  • Figure 4: Search space when $m_1$ is stalled. $m_1$ has generated tokenization while $m_2$ has only generated _token iz. We present probabilities here for simplicity. In practice, each token score is the cumulative log-prob of the local hypothesis with this token as the extension.
  • Figure 5: $\Delta$COMET results on our custom English–German models using Agreement-Based Ensembling. $\Delta$COMET is the improvement of ensembling two models via ABE over the best individual model. Individual COMET scores displayed on axes. Labeling indicates vocab size followed by epoch checkpoint. All results on en-de WMT24.
  • ...and 5 more figures