Table of Contents
Fetching ...

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Yue Yu, Ping Luo

TL;DR

SpecFuse tackles the bottlenecks of ensemble LLMs by enabling next-segment collaboration rather than combining full responses, reducing first-token latency and avoiding the need for trained fusion models. It implements an Inference–Verify loop where base LLMs generate candidate segments in parallel, are ranked by cross-model verification, and the top segment is iteratively fed back to all models. A Model Exit mechanism dynamically excludes underperforming models using cumulative quality, entropy, and softmax temperature to balance accuracy and compute. Across six benchmarks and five 7–9B-class base LLMs, SpecFuse consistently improves open-domain instruction-response and related tasks, offering competitive performance with significantly lowered resource use and greater generalization to unseen queries.

Abstract

Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance.

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

TL;DR

SpecFuse tackles the bottlenecks of ensemble LLMs by enabling next-segment collaboration rather than combining full responses, reducing first-token latency and avoiding the need for trained fusion models. It implements an Inference–Verify loop where base LLMs generate candidate segments in parallel, are ranked by cross-model verification, and the top segment is iteratively fed back to all models. A Model Exit mechanism dynamically excludes underperforming models using cumulative quality, entropy, and softmax temperature to balance accuracy and compute. Across six benchmarks and five 7–9B-class base LLMs, SpecFuse consistently improves open-domain instruction-response and related tasks, offering competitive performance with significantly lowered resource use and greater generalization to unseen queries.

Abstract

Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of Token Logits Fusion, Complete Response Fusion, and Our Segment Fusion methods.
  • Figure 2: An overview of SpecFuse, a novel ensemble framework, consisting of three parts: the Inference component, the Verify component, and the Model Exit mechanism. The blue solid line represents a single round of the process, while the dashed line shows the process of updating the models participating in the ensemble and refreshing the Input for the next round. In SpecFuse, the Inference component and Verify component synchronously update the model list after the Model Exit mechanism is executed. $\delta$ is the threshold, and when the probability drops below it, the model is excluded from the current generation process.
  • Figure 3: The variation in SpecFuse’s RougeL score as the number of base LLMs increases. +Qwen2-7B indicates adding a Qwen2-7B model to the ensemble.
  • Figure 4: The variation trends of SpecFuse's BertScore and RougeL score as the maximum generation length of each candidate segment changes.
  • Figure 5: In the test sets of the Open-Domain IR English benchmark and the Chinese benchmark, the percentage of iterations where each model generates the best candidate segment out of the total iterations in the ensemble framework during testing is measured.
  • ...and 1 more figures