Semantic Agreement Enables Efficient Open-Ended LLM Cascades
Duncan Soiffer, Steven Kolawole, Virginia Smith
TL;DR
This work tackles the open challenge of deferring open-ended generation in cascaded LLMs by introducing semantic agreement as a training-free, meaning-level consensus signal. By comparing outputs from multiple small models using a range of semantic similarity metrics, the system decides when to defer to a larger target model, achieving comparable or better quality at roughly 40% of the large-model cost and with substantial latency reductions. The approach is black-box friendly, robust to model updates, and does not require internal model access or task-specific training, making it practical for real-world deployments. Across translation, summarization, and QA tasks, semantic cascades demonstrate strong deferral signals and effective output selection, highlighting a promising direction for efficient, scalable LLM systems. The work also discusses limitations and potential future directions, including combining semantic signals with token-level confidence to further improve deferral decisions.
Abstract
Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement -- meaning-level consensus between ensemble outputs -- as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.
