Table of Contents
Fetching ...

From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

Yu Wu, Shuo Wu, Ye Tao, Yansong Li, Anand D. Sarwate

TL;DR

Inter-Cascade tackles the memoryless deferral limitation of standard LLM cascades by enabling online, in-context knowledge distillation from a strong model to a weak one via a reusable strategy repository. It introduces similarity-based retrieval to augment the weak model's context without parameter updates and provides a theoretical calibration analysis showing improved confidence estimates. Empirically, it achieves notable gains in weak-model and overall pipeline accuracy while substantially reducing strong-model calls and costs across reasoning and knowledge benchmarks, using a model-agnostic, scalable approach suitable for both open-source and API-based LLMs. The framework also yields a potential source of data for offline fine-tuning and can extend to distributed settings, bridging online inference with long-term knowledge transfer.

Abstract

Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model's context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model's confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.

From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

TL;DR

Inter-Cascade tackles the memoryless deferral limitation of standard LLM cascades by enabling online, in-context knowledge distillation from a strong model to a weak one via a reusable strategy repository. It introduces similarity-based retrieval to augment the weak model's context without parameter updates and provides a theoretical calibration analysis showing improved confidence estimates. Empirically, it achieves notable gains in weak-model and overall pipeline accuracy while substantially reducing strong-model calls and costs across reasoning and knowledge benchmarks, using a model-agnostic, scalable approach suitable for both open-source and API-based LLMs. The framework also yields a potential source of data for offline fine-tuning and can extend to distributed settings, bridging online inference with long-term knowledge transfer.

Abstract

Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model's context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model's confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.

Paper Structure

This paper contains 30 sections, 4 theorems, 22 equations, 6 figures, 23 tables, 3 algorithms.

Key Result

Theorem 2.2

Suppose that $\widehat{R}^{+}(\lambda)$ is a monotonic decreasing function of $\lambda$. Fix $\delta\in(0,1)$ and an integer $n\ge 1$. For $x\in\{0,1,\dots,n\}$, $\epsilon\in(0,1]$, and $b\in[1,\infty)$. Suppose that $\min\{\epsilon x+1,\,n-\epsilon x\}$ is moderately large and $1-\delta$ is not an

Figures (6)

  • Figure 1: (a) Pipeline of standard LLM Cascade systems. (b) Pipeline of Inter-Cascade. The unique components in Inter-Cascade are painted in orange. For the sake of clarity and readability, we only present the case of two LLMs Inter-Cascade system and the scalable parts beyond two LLMs are rendered in a lighter color.
  • Figure 2: GSM-Symbolic dataset: (a) Accuracy as a function of the confidence threshold for the base Weak LLM, Inter-Cascade with random strategies, and Inter-Cascade with retrieval strategies, and (b) - (d) their corresponding confidence histograms. Our Inter-Cascade (Retrieval) consistently concentrates probability mass near high confidence ($0.9$–$1.0$), while the weak and random variants place more mass at low confidence, which explains the accuracy gains observed in (a).
  • Figure 3: Accuracy as a function of the confidence threshold for the base Weak LLM and for the Weak LLM within the Inter-Cascade using random and retrieval strategies across three benchmarks.
  • Figure 4: Confidence histograms for three benchmarks. Columns correspond to (a)(d)(g) the base Weak LLM, (b)(e)(h) the Weak LLM within the Inter-Cascade using random strategies, and (c)(f)(i) the Weak LLM within the Inter-Cascade using retrieval strategies. Across all datasets, the Inter-Cascade with retrieval strategies concentrates probability mass near high confidence (0.9–1.0), while the base and random-strategy variants place more mass at lower confidence levels.
  • Figure 5: The dynamic of pipeline accuracy for both Jung's method and our standard Inter-Cascade on GSM-Symbolic.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Remark 2.1
  • Theorem 2.2
  • Lemma 4.1: Clopper--Pearson upper bound as a Beta quantile
  • proof
  • Theorem 5.1
  • proof
  • Theorem 6.1
  • proof