CharED: Character-wise Ensemble Decoding for Large Language Models
Kevin Gu, Eva Tuecke, Dmitriy Katz, Raya Horesh, David Alvarez-Melis, Mikhail Yurochkin
TL;DR
CharED tackles ensembling large language models without shared vocabularies or fine-tuning by performing decode-time, character-level averaging of next-token predictions. By converting subword outputs to marginal next-character probabilities and forming a weighted joint distribution $J = \alpha P_1 + (1-\alpha) P_2$, CharED can combine models with different tokenizers in a vocabulary-agnostic manner. Theoretical results establish decoding equivalence for $\alpha = 1$ and tokenization invariance across tokenizers, while experiments across coding, math, and toxicity benchmarks show improved performance and robust transfer across model pairs without additional training. These findings suggest CharED as a practical alternative to fine-tuning for leveraging complementary strengths of diverse LLMs, with future work extending to more models and alternative averaging strategies such as geometric means.
Abstract
Large language models (LLMs) have shown remarkable potential for problem solving, with open source models achieving increasingly impressive performance on benchmarks measuring areas from logical reasoning to mathematical ability. Ensembling models can further improve capabilities across a variety of domains. However, conventional methods of combining models at inference time such as shallow fusion necessitate a shared vocabulary and tokenization, and alternatives like fine-tuning for domain-specific performance are both time consuming and computationally expensive. We therefore present an inference-time ensembling algorithm aimed at "averaging" outputs from multiple LLMs and illustrate its improved performance across multiple domains compared to its constituent models alone. Character-wise ensemble decoding, CharED, finds the marginal distribution of each character for an individual model and performs a weighted average to generate an output, character by character. In coding, math, and toxicity benchmarks, we find our proposed model able to combine complimentary strengths of multiple LLMs, regardless of vocabulary, tokenization, or model size.
