Table of Contents
Fetching ...

CharED: Character-wise Ensemble Decoding for Large Language Models

Kevin Gu, Eva Tuecke, Dmitriy Katz, Raya Horesh, David Alvarez-Melis, Mikhail Yurochkin

TL;DR

CharED tackles ensembling large language models without shared vocabularies or fine-tuning by performing decode-time, character-level averaging of next-token predictions. By converting subword outputs to marginal next-character probabilities and forming a weighted joint distribution $J = \alpha P_1 + (1-\alpha) P_2$, CharED can combine models with different tokenizers in a vocabulary-agnostic manner. Theoretical results establish decoding equivalence for $\alpha = 1$ and tokenization invariance across tokenizers, while experiments across coding, math, and toxicity benchmarks show improved performance and robust transfer across model pairs without additional training. These findings suggest CharED as a practical alternative to fine-tuning for leveraging complementary strengths of diverse LLMs, with future work extending to more models and alternative averaging strategies such as geometric means.

Abstract

Large language models (LLMs) have shown remarkable potential for problem solving, with open source models achieving increasingly impressive performance on benchmarks measuring areas from logical reasoning to mathematical ability. Ensembling models can further improve capabilities across a variety of domains. However, conventional methods of combining models at inference time such as shallow fusion necessitate a shared vocabulary and tokenization, and alternatives like fine-tuning for domain-specific performance are both time consuming and computationally expensive. We therefore present an inference-time ensembling algorithm aimed at "averaging" outputs from multiple LLMs and illustrate its improved performance across multiple domains compared to its constituent models alone. Character-wise ensemble decoding, CharED, finds the marginal distribution of each character for an individual model and performs a weighted average to generate an output, character by character. In coding, math, and toxicity benchmarks, we find our proposed model able to combine complimentary strengths of multiple LLMs, regardless of vocabulary, tokenization, or model size.

CharED: Character-wise Ensemble Decoding for Large Language Models

TL;DR

CharED tackles ensembling large language models without shared vocabularies or fine-tuning by performing decode-time, character-level averaging of next-token predictions. By converting subword outputs to marginal next-character probabilities and forming a weighted joint distribution , CharED can combine models with different tokenizers in a vocabulary-agnostic manner. Theoretical results establish decoding equivalence for and tokenization invariance across tokenizers, while experiments across coding, math, and toxicity benchmarks show improved performance and robust transfer across model pairs without additional training. These findings suggest CharED as a practical alternative to fine-tuning for leveraging complementary strengths of diverse LLMs, with future work extending to more models and alternative averaging strategies such as geometric means.

Abstract

Large language models (LLMs) have shown remarkable potential for problem solving, with open source models achieving increasingly impressive performance on benchmarks measuring areas from logical reasoning to mathematical ability. Ensembling models can further improve capabilities across a variety of domains. However, conventional methods of combining models at inference time such as shallow fusion necessitate a shared vocabulary and tokenization, and alternatives like fine-tuning for domain-specific performance are both time consuming and computationally expensive. We therefore present an inference-time ensembling algorithm aimed at "averaging" outputs from multiple LLMs and illustrate its improved performance across multiple domains compared to its constituent models alone. Character-wise ensemble decoding, CharED, finds the marginal distribution of each character for an individual model and performs a weighted average to generate an output, character by character. In coding, math, and toxicity benchmarks, we find our proposed model able to combine complimentary strengths of multiple LLMs, regardless of vocabulary, tokenization, or model size.
Paper Structure (11 sections, 4 theorems, 2 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 4 theorems, 2 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 2.1

Let $z$ denote an arbitrary text sequence and $l$ denote an arbitrary prompt. Then for $\alpha = 1$,

Figures (4)

  • Figure 1: Our CharED algorithm ensembles models character by character while decoding. Model prompt: "Sally has four hats, and John has twice as many. How many total hats are there?" Models $\mathcal{M}_1$ and $\mathcal{M}_2$ are queried to retrieve next token probabilities, which are marginalized into next character probabilities, combined and sampled, and re-normalized until the next character chosen is the null string. This sequence is then added to the existing answer, which is fed back into both models.
  • Figure 2: CharED combines complementary strengths of its constituent LLMs, outperforming each of these in aggregate terms. Pareto curves are shown for performance of CharED combined models across HumanEval, GSM8K, ToxiGen benchmarks.
  • Figure 3: Performance tradeoffs of combined models using CharED on different benchmarking tasks.
  • Figure 4: Summed performance across two benchmarks of combined models, using performance shown in Figure 3.

Theorems & Definitions (6)

  • Theorem 2.1: Decoding Equivalence
  • Theorem 2.2: Tokenization Invariance
  • Theorem 1.1: Theorem \ref{['th:simple']}
  • proof
  • Theorem 2.1: Theorem \ref{['th:invariance']}
  • proof