Table of Contents
Fetching ...

SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable

Jiaxin Zhang, Zhuohang Li, Wendi Cui, Kamalika Das, Bradley malin, Sricharan Kumar

TL;DR

The paper tackles the problem of unreliable LLM outputs by proposing Scalable Consistency Ensemble (SCE), which combines multiple blackbox LLMs to reduce hallucinations and improve robustness. It introduces SCE-CHECK for scalable semantic consistency evaluation and SCE-FUSION for generative summarization of the top-consistent candidates, augmented by YOPO to achieve constant-time consistency checks. Empirical results on classification and open-domain QA datasets show that SCE outperforms single models and existing baselines in truthfulness and consistency, while delivering substantial gains in efficiency (notably a two-order-of-magnitude speedup over traditional pairwise checks). The framework demonstrates how leveraging model complementarity and efficient prompting can yield reliable, scalable LLM deployments with practical impact for high-stakes applications.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks. Ensembling multiple LLMs is a promising approach to generate reliable responses but conventional ensembling frameworks suffer from high computational overheads. This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs. The SCE framework systematically evaluates and integrates outputs to produce a cohesive result through two core components: SCE-CHECK, a mechanism that gauges the consistency between response pairs via semantic equivalence; and SCE-FUSION, which adeptly merges the highest-ranked consistent responses from SCE-CHECK, to optimize collective strengths and mitigating potential weaknesses. To improve the scalability with multiple inference queries, we further propose ``{You Only Prompt Once}'' (YOPO), a novel technique that reduces the inference complexity of pairwise comparison from quadratic to constant time. We perform extensive empirical evaluations on diverse benchmark datasets to demonstrate \methodName's effectiveness. Notably, the \saccheckcomponent outperforms conventional baselines with enhanced performance and a significant reduction in computational overhead.

SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable

TL;DR

The paper tackles the problem of unreliable LLM outputs by proposing Scalable Consistency Ensemble (SCE), which combines multiple blackbox LLMs to reduce hallucinations and improve robustness. It introduces SCE-CHECK for scalable semantic consistency evaluation and SCE-FUSION for generative summarization of the top-consistent candidates, augmented by YOPO to achieve constant-time consistency checks. Empirical results on classification and open-domain QA datasets show that SCE outperforms single models and existing baselines in truthfulness and consistency, while delivering substantial gains in efficiency (notably a two-order-of-magnitude speedup over traditional pairwise checks). The framework demonstrates how leveraging model complementarity and efficient prompting can yield reliable, scalable LLM deployments with practical impact for high-stakes applications.

Abstract

Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks. Ensembling multiple LLMs is a promising approach to generate reliable responses but conventional ensembling frameworks suffer from high computational overheads. This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs. The SCE framework systematically evaluates and integrates outputs to produce a cohesive result through two core components: SCE-CHECK, a mechanism that gauges the consistency between response pairs via semantic equivalence; and SCE-FUSION, which adeptly merges the highest-ranked consistent responses from SCE-CHECK, to optimize collective strengths and mitigating potential weaknesses. To improve the scalability with multiple inference queries, we further propose ``{You Only Prompt Once}'' (YOPO), a novel technique that reduces the inference complexity of pairwise comparison from quadratic to constant time. We perform extensive empirical evaluations on diverse benchmark datasets to demonstrate \methodName's effectiveness. Notably, the \saccheckcomponent outperforms conventional baselines with enhanced performance and a significant reduction in computational overhead.

Paper Structure

This paper contains 33 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of the scalable consistency ensemble method, Sce, towards reliable response generation.
  • Figure 2: (Left) Truthfulness accuracy compared with a single fixed model; (Right) Computational cost of Sce-Check compared to the pairwise-prompt method.
  • Figure 3: The effect of sample size on the truthfulness accuracy (left) with the most consistency votes and falseness accuracy (right) with the least consistency votes.
  • Figure 4: Scalability analysis of consistency check.
  • Figure 5: Effect of sample size on the proportion of the most consistent votes in model ensembles across four datasets.
  • ...and 1 more figures