Table of Contents
Fetching ...

LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Ling Liu

TL;DR

The focal diversity metric is introduced to capture the diversity-performance correlation among component LLMs of an ensemble and a diversity-optimized ensemble pruning algorithm is developed to select the top-k sub-ensembles from a pool of base LLMs.

Abstract

Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of $N$ base LLMs. Our pruning method recommends top-performing LLM subensembles of size $S$, often much smaller than $N$. (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2\% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1\%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by $3.9\mathrm{x}$ in F1, and on XSum by more than $38$ in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at https://github.com/git-disl/llm-topla

LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity

TL;DR

The focal diversity metric is introduced to capture the diversity-performance correlation among component LLMs of an ensemble and a diversity-optimized ensemble pruning algorithm is developed to select the top-k sub-ensembles from a pool of base LLMs.

Abstract

Combining large language models during training or at inference time has shown substantial performance gain over component LLMs. This paper presents LLM-TOPLA, a diversity-optimized LLM ensemble method with three unique properties: (i) We introduce the focal diversity metric to capture the diversity-performance correlation among component LLMs of an ensemble. (ii) We develop a diversity-optimized ensemble pruning algorithm to select the top-k sub-ensembles from a pool of base LLMs. Our pruning method recommends top-performing LLM subensembles of size , often much smaller than . (iii) We generate new output for each prompt query by utilizing a learn-to-ensemble approach, which learns to detect and resolve the output inconsistency among all component LLMs of an ensemble. Extensive evaluation on four different benchmarks shows good performance gain over the best LLM ensemble methods: (i) In constrained solution set problems, LLM-TOPLA outperforms the best-performing ensemble (Mixtral) by 2.2\% in accuracy on MMLU and the best-performing LLM ensemble (MoreAgent) on GSM8k by 2.1\%. (ii) In generative tasks, LLM-TOPLA outperforms the top-2 performers (Llama70b/Mixtral) on SearchQA by in F1, and on XSum by more than in ROUGE-1. Our code and dataset, which contains outputs of 8 modern LLMs on 4 benchmarks is available at https://github.com/git-disl/llm-topla
Paper Structure (22 sections, 10 equations, 6 figures, 11 tables)

This paper contains 22 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: We present the different types of tasks with their solution spaces.
  • Figure 2: An overview of TOPLA-Framework.
  • Figure 3: For each task, all candidate ensemble teams from the base model pools are plotted with their focal diversity scores and their performance metrics. The colors represent the size of each team, and the dotted line represents the best-performing individual model in the pool. We also plot the best-fit line with Pearson's Correlation Coefficient $\rho$ to show the correlation between performance and the focal diversity.
  • Figure 4: The effect of Focal-diversity Pruning is shown in the first two figures, and the effect of sliding window and selective global attention is shown in the third plot. Lastly, we show the effect of $K$ on TOPLA-Summary, and Weighted models in the GSM8k dataset.
  • Figure 5: The effect of training data size to the performance.
  • ...and 1 more figures