In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

Vishnu Sarukkai; Asanshay Gupta; James Hong; Michaël Gharbi; Kayvon Fatahalian

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian

TL;DR

<3-5 sentence high-level summary> The paper tackles the high inference costs of scaling LLM-based agents by introducing in-context distillation, a training-free approach that continuously guides a frozen student model with teacher demonstrations retrieved at each step. It couples this with self-consistency cascades to decide when to defer to the teacher, creating an adaptive, cost-efficient agent that maintains high task performance. Empirical results on ALFWorld and AppWorld show substantial cost reductions (2.5x and 2x respectively) with iso-accuracy or better, and demonstrate that the method generalizes to open-weight LLMs and offers a practical alternative to fine-tuning for rapid prototyping. The approach markedly lowers deployment barriers while preserving experimentation velocity, making advanced agentic systems more economically viable across diverse applications.

Abstract

The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

TL;DR

Abstract

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)