Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Nils Strassenburg; Boris Glavic; Tilmann Rabl

Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

Nils Strassenburg, Boris Glavic, Tilmann Rabl

TL;DR

The paper presents Just-in-Time Model Replacement (JITR) as a practical framework to replace recurring tasks handled by costly large language models with cheaper, task-specific surrogates. It outlines a pipeline from task identification and dataset generation through base-model search, surrogate development via distillation, and continuous evaluation and monitoring, culminating in a working prototype called Poodle. Preliminary results show meaningful monetary and latency savings, with surrogates achieving competitive accuracy on sentiment-classification tasks and model search improving development time and data efficiency. The work highlights core research questions around metadata quality, scalable model search, and robust monitoring necessary for real-world deployment and cost amortization.

Abstract

Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models. LLMs require few, if any, training examples and can be utilized by users without expertise in model development. However, this comes at the cost of substantially higher resource and energy consumption compared to smaller models, which often achieve similar predictive performance for simple tasks. In this paper, we present our vision for just-in-time model replacement (JITR), where, upon identifying a recurring task in calls to an LLM, the model is replaced transparently with a cheaper alternative that performs well for this specific task. JITR retains the ease of use and low development effort of LLMs, while saving significant cost and energy. We discuss the main challenges in realizing our vision regarding the identification of recurring tasks and the creation of a custom model. Specifically, we argue that model search and transfer learning will play a crucial role in JITR to efficiently identify and fine-tune models for a recurring task. Using our JITR prototype Poodle, we achieve significant savings for exemplary tasks.

Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

TL;DR

Abstract

Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)