From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades
Yu Wu, Shuo Wu, Ye Tao, Yansong Li, Anand D. Sarwate
TL;DR
Inter-Cascade tackles the memoryless deferral limitation of standard LLM cascades by enabling online, in-context knowledge distillation from a strong model to a weak one via a reusable strategy repository. It introduces similarity-based retrieval to augment the weak model's context without parameter updates and provides a theoretical calibration analysis showing improved confidence estimates. Empirically, it achieves notable gains in weak-model and overall pipeline accuracy while substantially reducing strong-model calls and costs across reasoning and knowledge benchmarks, using a model-agnostic, scalable approach suitable for both open-source and API-based LLMs. The framework also yields a potential source of data for offline fine-tuning and can extend to distributed settings, bridging online inference with long-term knowledge transfer.
Abstract
Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model's context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model's confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.
