Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Yuxuan Yao; Han Wu; Mingyang Liu; Sichun Luo; Xiongwei Han; Jie Liu; Zhijiang Guo; Linqi Song

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song

TL;DR

This paper investigates how to effectively ensemble large language models (LLMs) by identifying key determinants of success and proposing a practical method. It introduces a determine-then-ensemble strategy that prioritizes compatibility among models and a Union Top-$k$ Ensembling (UniTE) approach that avoids full vocabulary alignment by using the union of top-$k$ tokens across models. Empirical results across diverse benchmarks show that UniTE consistently improves performance, reduces token-level computation, and maintains low latency compared to existing methods, especially when base models have similar strengths. The work offers a robust framework for selecting compatible LLMs and efficiently combining them for improved generation quality.

Abstract

Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically investigate the factors influencing ensemble performance, identifying model performance, vocabulary size, and response style as key determinants, revealing that compatibility among models is essential for effective ensembling. This analysis leads to the development of a simple yet effective model selection strategy that identifies compatible models. Additionally, we introduce the \textsc{Uni}on \textsc{T}op-$k$ \textsc{E}nsembling (\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

TL;DR

Ensembling (UniTE) approach that avoids full vocabulary alignment by using the union of top-

tokens across models. Empirical results across diverse benchmarks show that UniTE consistently improves performance, reduces token-level computation, and maintains low latency compared to existing methods, especially when base models have similar strengths. The work offers a robust framework for selecting compatible LLMs and efficiently combining them for improved generation quality.

Abstract

\textsc{E}nsembling (\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.

Paper Structure (29 sections, 6 figures, 13 tables, 1 algorithm)

This paper contains 29 sections, 6 figures, 13 tables, 1 algorithm.

Introduction
Related Works
Understanding Model Ensembling from Model Capacity, Vocabulary Size and Task
Impact of Model Performance Discrepancy
Influence of Vocabulary Size
Task-Specific Challenges in Model Ensembling
Methodology
Model selection strategy
Union top-$k$ ensembling
Experiments
Setup
Models
Baselines
Benchmarks
Main Results
...and 14 more sections

Figures (6)

Figure 1: The impact of performance disparity among models on ensemble effectiveness across different datasets and methods is examined. We compare these methods to the individual performances of LLaMA2 and Mistral, indicated by dashed lines.
Figure 2: The impact of performance differences on model ensembling effectiveness on GSM8K dataset. OOM represents out of memory issue.
Figure 3: Token distribution of different models. We extract the top 15 words from the vocabulary of each model and apply a softmax processing to their corresponding logits.
Figure 4: Impact of the choice of hyperparameter $k$ on the ARC and TriviaQA datasets. Increasing $k$ beyond a certain point leads to a slight decline or no improvement in performance.
Figure 5: Latency comparison of different methods. The darker the color, the greater the latency.
...and 1 more figures

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

TL;DR

Abstract

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)