Table of Contents
Fetching ...

Hit the Sweet Spot! Span-Level Ensemble for Large Language Models

Yangyifan Xu, Jianghao Chen, Junhong Wu, Jiajun Zhang

TL;DR

SweetSpan tackles the challenge of ensembling diverse LLMs by introducing a span-level, training-free method that balances real-time adjustability with rich ensemble information. Each round, candidate models independently generate word-based spans from a shared prefix; perplexity-based evaluation with adaptive outlier filtering selects the best span to extend the prefix. The approach shows consistent improvements over strong baselines across multiple tasks, including a notable GSM8K gain, and demonstrates robustness in settings with large model performance gaps. By avoiding vocabulary- and data-exposure-related drawbacks of token- and sample-level ensembles, SweetSpan offers a versatile and scalable solution for ensemble reasoning in diverse language tasks.

Abstract

Ensembling various LLMs to unlock their complementary potential and leverage their individual strengths is highly valuable. Previous studies typically focus on two main paradigms: sample-level and token-level ensembles. Sample-level ensemble methods either select or blend fully generated outputs, which hinders dynamic correction and enhancement of outputs during the generation process. On the other hand, token-level ensemble methods enable real-time correction through fine-grained ensemble at each generation step. However, the information carried by an individual token is quite limited, leading to suboptimal decisions at each step. To address these issues, we propose SweetSpan, a span-level ensemble method that effectively balances the need for real-time adjustments and the information required for accurate ensemble decisions. Our approach involves two key steps: First, we have each candidate model independently generate candidate spans based on the shared prefix. Second, we calculate perplexity scores to facilitate mutual evaluation among the candidate models and achieve robust span selection by filtering out unfaithful scores. To comprehensively evaluate ensemble methods, we propose a new challenging setting (ensemble models with significant performance gaps) in addition to the standard setting (ensemble the best-performing models) to assess the performance of model ensembles in more realistic scenarios. Experimental results in both standard and challenging settings across various language generation tasks demonstrate the effectiveness, robustness, and versatility of our approach compared with previous ensemble methods.

Hit the Sweet Spot! Span-Level Ensemble for Large Language Models

TL;DR

SweetSpan tackles the challenge of ensembling diverse LLMs by introducing a span-level, training-free method that balances real-time adjustability with rich ensemble information. Each round, candidate models independently generate word-based spans from a shared prefix; perplexity-based evaluation with adaptive outlier filtering selects the best span to extend the prefix. The approach shows consistent improvements over strong baselines across multiple tasks, including a notable GSM8K gain, and demonstrates robustness in settings with large model performance gaps. By avoiding vocabulary- and data-exposure-related drawbacks of token- and sample-level ensembles, SweetSpan offers a versatile and scalable solution for ensemble reasoning in diverse language tasks.

Abstract

Ensembling various LLMs to unlock their complementary potential and leverage their individual strengths is highly valuable. Previous studies typically focus on two main paradigms: sample-level and token-level ensembles. Sample-level ensemble methods either select or blend fully generated outputs, which hinders dynamic correction and enhancement of outputs during the generation process. On the other hand, token-level ensemble methods enable real-time correction through fine-grained ensemble at each generation step. However, the information carried by an individual token is quite limited, leading to suboptimal decisions at each step. To address these issues, we propose SweetSpan, a span-level ensemble method that effectively balances the need for real-time adjustments and the information required for accurate ensemble decisions. Our approach involves two key steps: First, we have each candidate model independently generate candidate spans based on the shared prefix. Second, we calculate perplexity scores to facilitate mutual evaluation among the candidate models and achieve robust span selection by filtering out unfaithful scores. To comprehensively evaluate ensemble methods, we propose a new challenging setting (ensemble models with significant performance gaps) in addition to the standard setting (ensemble the best-performing models) to assess the performance of model ensembles in more realistic scenarios. Experimental results in both standard and challenging settings across various language generation tasks demonstrate the effectiveness, robustness, and versatility of our approach compared with previous ensemble methods.
Paper Structure (35 sections, 3 equations, 3 figures, 8 tables)

This paper contains 35 sections, 3 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Motivation of SweetSpan. Sample-level ensemble methods struggle to produce a correct answer when all candidate outputs are flawed, while token-level ensemble methods make suboptimal choices at each generation step due to inadequate information. SweetSpan balances the flexibility needed for real-time adjustments and the information required for accurate ensemble decisions at each step.
  • Figure 2: The SweetSpan framework.SweetSpan consists of two steps. (a) First, we have each candidate model generate a span based on the shared prefix. (b) Next, we facilitate mutual evaluation among the candidate models by calculating perplexity and achieve robust span selection by filtering out unfaithful scores.
  • Figure 3: Main results on various language generation tasks under the standard setting.