Table of Contents
Fetching ...

Semantic Agreement Enables Efficient Open-Ended LLM Cascades

Duncan Soiffer, Steven Kolawole, Virginia Smith

TL;DR

This work tackles the open challenge of deferring open-ended generation in cascaded LLMs by introducing semantic agreement as a training-free, meaning-level consensus signal. By comparing outputs from multiple small models using a range of semantic similarity metrics, the system decides when to defer to a larger target model, achieving comparable or better quality at roughly 40% of the large-model cost and with substantial latency reductions. The approach is black-box friendly, robust to model updates, and does not require internal model access or task-specific training, making it practical for real-world deployments. Across translation, summarization, and QA tasks, semantic cascades demonstrate strong deferral signals and effective output selection, highlighting a promising direction for efficient, scalable LLM systems. The work also discusses limitations and potential future directions, including combining semantic signals with token-level confidence to further improve deferral decisions.

Abstract

Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

Semantic Agreement Enables Efficient Open-Ended LLM Cascades

TL;DR

This work tackles the open challenge of deferring open-ended generation in cascaded LLMs by introducing semantic agreement as a training-free, meaning-level consensus signal. By comparing outputs from multiple small models using a range of semantic similarity metrics, the system decides when to defer to a larger target model, achieving comparable or better quality at roughly 40% of the large-model cost and with substantial latency reductions. The approach is black-box friendly, robust to model updates, and does not require internal model access or task-specific training, making it practical for real-world deployments. Across translation, summarization, and QA tasks, semantic cascades demonstrate strong deferral signals and effective output selection, highlighting a promising direction for efficient, scalable LLM systems. The work also discusses limitations and potential future directions, including combining semantic signals with token-level confidence to further improve deferral decisions.

Abstract

Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

Paper Structure

This paper contains 60 sections, 7 figures, 63 tables.

Figures (7)

  • Figure 1: Overview of the semantic cascade framework. An ensemble of small models generates responses to a user query. Semantic similarity metrics assess agreement among outputs. High semantic agreement indicates reliability: the system returns the highest-scoring ensemble response. Low agreement signals uncertainty: the query defers to a larger target model, ensuring that expensive model is called only when necessary.
  • Figure 2: Semantic deferrals achieve superior efficiency-quality tradeoffs across diverse generation tasks. On SQuAD, CNN/DM, and WMT FR$\to$EN, semantic methods consistently outperform token-level confidence across all quality and latency levels and even surpass the target model in some cases. These results suggest semantic agreement captures output reliability dimensions that token-level confidence misses. The curves shown correspond to the cascades from \ref{['tab:efficiency_results']}.
  • Figure 3: Semantic cascades select stronger outputs than their constituent models and perform more effective deferral decisions than larger token-level cascades.(a) Deferral curves for a semantic cascade of [Qwen2.5-1.5B, Gemma3-1B, FLAN-T5-Large], token-level cascades of its individual ensemble models, and a token-level cascade of Qwen2.5-3B, evaluated on WMT DE$\to$FR. (b) The same framework for a semantic cascade of [Llama3.2-3B, Qwen2.5-3B, FLAN-T5-Large] and larger token-level model Llama3.1-8B, evaluated on SQuAD. All cascades defer to Llama3.1-70B; for each cascade only the curve with highest AUC is shown. In both cases, the semantic ensemble has a lower baseline than the larger token-level model, but overtakes it due to superior deferral decisions. Additionally, the semantic ensemble considerably outperforms its constituent models across all deferral rates. This demonstrates that semantic similarity's advantage comes not just from reliably selecting strong responses, but also from genuinely superior deferral decisions.
  • Figure 4: (a) Comparison of deferral curves for different BLEURT sizes for a semantic cascade of Qwen2.5-1.5B, Gemma3-1B, mT0-Large, deferring to large model Llama3.1-8B. Smaller sizes of BLEURT lead to worse performance. (b) A similar comparison on SQuAD1.1 of the same semantic ensemble, deferring to Qwen2.5-7B. Smaller sizes of BLEURT do not impact cascade performance due to the short nature of responses and the binary evaluation scheme.
  • Figure 5: Deferral curve on WMT19 DE$\to$FR, WMT14 FR$\to$EN, and WMT14 EN$\to$FR for a simplest semantic ensemble of Qwen2.5-1.5B and Qwen2.5-0.5B, always using the outputs of Qwen2.5-1.5B when not deferring, plotted with deferral curves for token-level Qwen2.5-1.5B. Both cascades defer to Qwen2.5-7B. In all cases, the deferral curve from this semantic cascade improves over single-model token-level deferral signals, improving for instance the AUC to .6307 on DE$\to$FR over single-model token-level deferral signals (Best (Chow-Avg) AUC: .6263), and significantly over random deferral (AUC: .6130). This demonstrates how, even in a very simple ensemble, semantic similarity with a substantially worse model can still provide a strong indication for when deferral is appropriate.
  • ...and 2 more figures