Table of Contents
Fetching ...

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian

TL;DR

<3-5 sentence high-level summary> The paper tackles the high inference costs of scaling LLM-based agents by introducing in-context distillation, a training-free approach that continuously guides a frozen student model with teacher demonstrations retrieved at each step. It couples this with self-consistency cascades to decide when to defer to the teacher, creating an adaptive, cost-efficient agent that maintains high task performance. Empirical results on ALFWorld and AppWorld show substantial cost reductions (2.5x and 2x respectively) with iso-accuracy or better, and demonstrate that the method generalizes to open-weight LLMs and offers a practical alternative to fine-tuning for rapid prototyping. The approach markedly lowers deployment barriers while preserving experimentation velocity, making advanced agentic systems more economically viable across diverse applications.

Abstract

The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs

TL;DR

<3-5 sentence high-level summary> The paper tackles the high inference costs of scaling LLM-based agents by introducing in-context distillation, a training-free approach that continuously guides a frozen student model with teacher demonstrations retrieved at each step. It couples this with self-consistency cascades to decide when to defer to the teacher, creating an adaptive, cost-efficient agent that maintains high task performance. Empirical results on ALFWorld and AppWorld show substantial cost reductions (2.5x and 2x respectively) with iso-accuracy or better, and demonstrate that the method generalizes to open-weight LLMs and offers a practical alternative to fine-tuning for rapid prototyping. The approach markedly lowers deployment barriers while preserving experimentation velocity, making advanced agentic systems more economically viable across diverse applications.

Abstract

The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce , which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at \times, reducing per-episode costs from \0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \\textbf{2 cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

Paper Structure

This paper contains 51 sections, 13 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of the in-context distillation pipeline. 1) Demonstration collection phase: the teacher LLM creates a dataset of exemplars to be stored in a vector database. 2) Inference phase: at each decision-making step for the agent, the most relevant in-context examples are retrieved to inject into the student LLM's prompt. The student then produces multiple samples to be evaluated for self-consistency. If inconsistent, the teacher is sampled.
  • Figure 2: Combining in-context learning with cascades optimizes cost-accuracy tradeoffs. Cost-accuracy tradeoff for a variety of different model selections and techniques. Accuracy numbers for the IC + Cascade experiments break the Pareto frontier defined by the rest of the examples, performing better than the teacher on ALFWorld and significantly above others at a similar cost on both domains.
  • Figure 3: Retrieving more in-context examples can boost task accuracy in exchange for higher costs. Cost-accuracy tradeoff for varying numbers of retrieved in-context exemplars ($k$, labeled on each datapoint) on ALFWorld and AppWorld (Student IC, no cascade). On ALFWorld, accuracy improves rapidly from $k{=}1$ to $k{=}4$, then exhibits diminishing returns beyond $k{=}6$. On AppWorld, accuracy peaks at $k{=}5$ with more modest overall gains. In this paper, we use $k{=}6$ on ALFWorld and $k{=}3$ on AppWorld by default.
  • Figure 4: Scaling teacher database size makes in-context distillation more effective. The cost-accuracy tradeoff for varying teacher database sizes (labeled on each datapoint) on ALFWorld and AppWorld (Student IC, no cascade). As we scale database size, more relevant examples are retrieved at each ReAct step, helping the agent solve tasks more successfully (increasing accuracy) and, as a corollary, more efficiently (reducing costs by shortening trajectories).