Bridging the Gap between Different Vocabularies for LLM Ensemble
Yangyifan Xu, Jinliang Lu, Jiajun Zhang
TL;DR
EVA tackles the challenge of ensembling heterogeneous LLMs whose vocabularies differ by learning cross-model vocabulary alignments from overlapping tokens and projecting per-model output distributions into a shared space for fine-grained, token-level ensembling. It introduces a three-step noise reduction process and a filtering strategy to exclude unfaithful models, enabling robust combination at each generation step without task-specific fusion models. Empirical results across commonsense and arithmetic reasoning, machine translation, and data-to-text generation show EVA outperforms individual models and several baselines, with notable gains on GSM8K and broad cross-task improvements. The approach is model-agnostic, relies on a single projection matrix, and scales to diverse vocabularies, offering practical benefits for leveraging multiple LLMs in real-world settings.
Abstract
Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.
