Table of Contents
Fetching ...

SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading

Yuanzhe Shen, Yide Liu, Zisu Huang, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang

TL;DR

SATER addresses the efficiency bottleneck in routing between small and large language models by introducing a two-stage training approach: Stage I optimizes for shortest responses to cut redundant tokens, and Stage II enables confidence-based refusals to reduce wasted computation. The framework supports both pre-generation and cascade routing, delivering substantial improvements in cost and latency while maintaining or enhancing accuracy across diverse datasets and models. A robust evaluation framework with ToA/ToGR and AGL/AROL metrics demonstrates SATER’s ability to adapt to task difficulty and distribution shifts, highlighting the value of refusal-aware and vote-based aggregation strategies. Practically, SATER enables cost-effective deployment of mixed-model inference at scale, with guidelines on when to favor pre-generation versus cascade routing depending on cost ratios and task type.

Abstract

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between performance and cost: high-performing LLMs typically incur substantial expenses, whereas budget-friendly small language models (SLMs) are constrained by limited capabilities. Current research primarily proposes two routing strategies: pre-generation routing and cascade routing. Both approaches have distinct characteristics, with cascade routing typically offering superior cost-effectiveness and accuracy despite its higher latency. To further address the limitations of both approaches, we introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism. SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing. Experiments across three SLMs and six datasets, varying in type and complexity, demonstrate that SATER achieves comparable performance while consistently reducing computational costs by over 50\% and cascade latency by over 80\%.

SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading

TL;DR

SATER addresses the efficiency bottleneck in routing between small and large language models by introducing a two-stage training approach: Stage I optimizes for shortest responses to cut redundant tokens, and Stage II enables confidence-based refusals to reduce wasted computation. The framework supports both pre-generation and cascade routing, delivering substantial improvements in cost and latency while maintaining or enhancing accuracy across diverse datasets and models. A robust evaluation framework with ToA/ToGR and AGL/AROL metrics demonstrates SATER’s ability to adapt to task difficulty and distribution shifts, highlighting the value of refusal-aware and vote-based aggregation strategies. Practically, SATER enables cost-effective deployment of mixed-model inference at scale, with guidelines on when to favor pre-generation versus cascade routing depending on cost ratios and task type.

Abstract

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between performance and cost: high-performing LLMs typically incur substantial expenses, whereas budget-friendly small language models (SLMs) are constrained by limited capabilities. Current research primarily proposes two routing strategies: pre-generation routing and cascade routing. Both approaches have distinct characteristics, with cascade routing typically offering superior cost-effectiveness and accuracy despite its higher latency. To further address the limitations of both approaches, we introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism. SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing. Experiments across three SLMs and six datasets, varying in type and complexity, demonstrate that SATER achieves comparable performance while consistently reducing computational costs by over 50\% and cascade latency by over 80\%.

Paper Structure

This paper contains 45 sections, 7 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Illustration of SATER. We train the SLM in two stages for optimal cost, accuracy, and latency. Stage I performs preference optimization with the shortest correct and longest incorrect responses, while Stage II employs prompt-based fine-tuning to teach the SLM to reject complex tasks. During inference, rejected queries are either routed directly to LLMs (pre-generation) or processed via weighted majority voting (cascade) for refined routing.
  • Figure 2: Introduction to Routing Strategies and Metrics. Strategy A routes the hardest questions (beyond LLM's capability) to LLM first, while Strategy B only routes questions the SLM cannot solve but the LLM can.
  • Figure 3: Average Cost-Accuracy Plot. Results are based on the average of six benchmarks. The top three curves represent pre-generation routing, while the bottom three display cascade routing (cost ratio: 1:13.75). The Average Cost-Accuracy(100) Plot and the individual results for each benchmark are presented in Appendix \ref{['sec:avg-cost-accuracy']}.
  • Figure 4: Comparison plot of cost-accuracy(100) between pre-generation and cascade routing, averaged across all benchmarks. Results are based on Qwen2.5-7B-Instruct, with three subplots depicting cost ratios of 1:25, 1:50, and 1:100 from left to right. Detailed results for individual benchmarks are available in Appendix \ref{['sec:comparison-plot']} (Figure \ref{['fig:rv_25']}, \ref{['fig:rv_50']}, \ref{['fig:rv_100']}).
  • Figure 5: Three examples from SATER. Responses are color-coded: red (incorrect), green (correct), blue (refused). White box shows the majority-voted answer, confidence score, and AGL or AROL, based on routing decisions.
  • ...and 8 more figures