Bridging the Gap between Different Vocabularies for LLM Ensemble

Yangyifan Xu; Jinliang Lu; Jiajun Zhang

Bridging the Gap between Different Vocabularies for LLM Ensemble

Yangyifan Xu, Jinliang Lu, Jiajun Zhang

TL;DR

EVA tackles the challenge of ensembling heterogeneous LLMs whose vocabularies differ by learning cross-model vocabulary alignments from overlapping tokens and projecting per-model output distributions into a shared space for fine-grained, token-level ensembling. It introduces a three-step noise reduction process and a filtering strategy to exclude unfaithful models, enabling robust combination at each generation step without task-specific fusion models. Empirical results across commonsense and arithmetic reasoning, machine translation, and data-to-text generation show EVA outperforms individual models and several baselines, with notable gains on GSM8K and broad cross-task improvements. The approach is model-agnostic, relies on a single projection matrix, and scales to diverse vocabularies, offering practical benefits for leveraging multiple LLMs in real-world settings.

Abstract

Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.

Bridging the Gap between Different Vocabularies for LLM Ensemble

TL;DR

Abstract

Paper Structure (43 sections, 9 equations, 5 figures, 7 tables)

This paper contains 43 sections, 9 equations, 5 figures, 7 tables.

Introduction
Vocabulary Overlap Phenomenon
Impact of Vocabulary Distinction
Overlap between Vocabularies
Our Method
Cross-Model Vocabulary Alignment
Vocabulary Projection
Noise Reduction
Step-1: Top-$t$ Truncation.
Step-2: Threshold Truncation.
Step-3: Variance Truncation.
LLMs Ensemble
Experimental Settings
Datasets
Candidate LLMs
...and 28 more sections

Figures (5)

Figure 1: Motivation of EVA. For the problem of train travel distance, both TigerBot and ChatGLM provide wrong answers. Ensembling over completely generated outputs cannot derive the correct answer. EVA achieves correct answers by performing fine-grained ensemble at each generation step, allowing each token to benefit from the ensemble.
Figure 2: The EVA framework. EVA consists of two steps. (a) Firstly, we establishes alignment between the vocabularies of different models. (b) Next, we project the output distributions of different LLMs into a unified space using the established vocabulary alignment and exclude unfaithful tokens to perform fine-grained ensemble.
Figure 3: The rate of overlapping tokens between different LLMs vocabularies. The models are arranged in ascending order based on vocabulary size. Each cell represents the proportion of shared tokens between the horizontal and vertical models, relative to the vocabulary size of the vertical model.
Figure 4: The average edit distance of GSM8K (orange solid line) and Flores-Zh-En (green dotted line) tasks across various top-$n$ ranges. The average edit distance indicates the output token diversity.
Figure 5: Effect of number of ensemble models. The orange bars represent the performance of individual models, while the green line denotes the result of ensembling multiple models, denoted by their initials.

Bridging the Gap between Different Vocabularies for LLM Ensemble

TL;DR

Abstract

Bridging the Gap between Different Vocabularies for LLM Ensemble

Authors

TL;DR

Abstract

Table of Contents

Figures (5)