Table of Contents
Fetching ...

Cache & Distil: Optimising API Calls to Large Language Models

Guillem Ramírez, Matthias Lindemann, Alexandra Birch, Ivan Titov

TL;DR

Neural caching targets cost-efficient deployment of LLMs by training a compact student model on the LLM’s predictions and applying a policy to decide when to query the LLM. The authors formulate this as an online knowledge distillation problem with a budgeted querying process and evaluate classic active-learning criteria (notably Margin Sampling and Query by Committee) as policies. Across four classification datasets and varying budgets, Margin Sampling shows strong online performance without student retraining, while Query by Committee offers robust gains when the student is retrained periodically; embedding-based coresets underperform in this setting. The work demonstrates practical, cost-aware LLM utilization, introduces a benchmark with LLM annotations, and provides actionable insights for online distillation and policy design in real-world API-cost-constrained scenarios.

Abstract

Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student -- which is continuously trained on the responses of the LLM. This student gradually gains proficiency in independently handling an increasing number of user requests, a process we term neural caching. The crucial element in neural caching is a policy that decides which requests should be processed by the student alone and which should be redirected to the LLM, subsequently aiding the student's learning. In this study, we focus on classification tasks, and we consider a range of classic active learning-based selection criteria as the policy. Our experiments suggest that Margin Sampling and Query by Committee bring consistent benefits across tasks and budgets.

Cache & Distil: Optimising API Calls to Large Language Models

TL;DR

Neural caching targets cost-efficient deployment of LLMs by training a compact student model on the LLM’s predictions and applying a policy to decide when to query the LLM. The authors formulate this as an online knowledge distillation problem with a budgeted querying process and evaluate classic active-learning criteria (notably Margin Sampling and Query by Committee) as policies. Across four classification datasets and varying budgets, Margin Sampling shows strong online performance without student retraining, while Query by Committee offers robust gains when the student is retrained periodically; embedding-based coresets underperform in this setting. The work demonstrates practical, cost-aware LLM utilization, introduces a benchmark with LLM annotations, and provides actionable insights for online distillation and policy design in real-world API-cost-constrained scenarios.

Abstract

Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student -- which is continuously trained on the responses of the LLM. This student gradually gains proficiency in independently handling an increasing number of user requests, a process we term neural caching. The crucial element in neural caching is a policy that decides which requests should be processed by the student alone and which should be redirected to the LLM, subsequently aiding the student's learning. In this study, we focus on classification tasks, and we consider a range of classic active learning-based selection criteria as the policy. Our experiments suggest that Margin Sampling and Query by Committee bring consistent benefits across tasks and budgets.
Paper Structure (42 sections, 2 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 42 sections, 2 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Neural caching (one iteration): A student generates a response to a user request. The policy algorithm determines whether to rely on the student's response or to call an LLM. LLM responses are stored and used to re-train the student as more data becomes available.
  • Figure 2: Accuracy curve with respect to budgets for neural caching without student retraining.
  • Figure 3: Accuracy curve with respect to budgets, in the neural caching problem with student retraining. Error lines indicate variance. We have averaged results across the four datasets.
  • Figure 4: Accuracy curve with respect to budgets, in the neural caching problem with student retraining. Error lines indicate variance.
  • Figure 5: On the left, we order data points by their margin and plot the accuracy of their respective labels generated by the student and teacher. We observe that the greatest advantage of using the labels from the teacher comes with low margins. On the right, the accuracy of the labels generated by the LLM calls in neural caching with no student retraining. We observe that MS and QBC are more likely to generate wrong labels. We focus on Openbook for both plots.