Table of Contents
Fetching ...

Online Cascade Learning for Efficient Inference over Streams

Lunyiu Nie, Zhimin Ding, Erdong Hu, Christopher Jermaine, Swarat Chaudhuri

TL;DR

Experimental results show that the proposed online cascade learning method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.

Abstract

Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning, the first approach to address this challenge. The objective here is to learn a "cascade" of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the collected LLM demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.

Online Cascade Learning for Efficient Inference over Streams

TL;DR

Experimental results show that the proposed online cascade learning method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.

Abstract

Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning, the first approach to address this challenge. The objective here is to learn a "cascade" of models, starting with lower-capacity models (such as logistic regression) and ending with a powerful LLM, along with a deferral policy that determines the model to be used on a given input. We formulate the task of learning cascades online as an imitation-learning problem, where smaller models are updated over time imitating the collected LLM demonstrations, and give a no-regret algorithm for the problem. Experimental results across four benchmarks show that our method parallels LLMs in accuracy while cutting down inference costs by as much as 90% with strong robustness against input distribution shifts, underscoring its efficacy and adaptability in stream processing.
Paper Structure (46 sections, 5 theorems, 28 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 5 theorems, 28 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

With online gradient descent and a learning rate $\eta_t = t^{-1/2}$, the total regret $\gamma$ of the online ensemble learning algorithm is bounded as follows: Therefore, $\lim_{T\rightarrow \infty} \gamma/T \leq 0$.

Figures (11)

  • Figure 1: A sentiment analysis task over a stream of IMDB movie reviews imdb. We use the cheapest logistic regression model (green lines) to process simpler queries and defer more complex queries to the larger models (orange & red lines). When the cascade proceeds to the LLM, the annotations are collected to update the smaller models online (blue lines).
  • Figure 2: The proposed online cascade learning framework, where smaller models with monotonically increasing capacities and costs ($c_1 < c_2 < ... < c_N$) can progressively learn from the ongoing outputs of an LLM (as denoted in red arrows). Meanwhile, the deferral policy and corresponding confidence scores are also calibrated online (in green arrows).
  • Figure 3: Accuracy curve (and Recall curve for HateSpeech) with respect to costs, using GPT-3.5 Turbo as the LLM in a cascade that also comprises logistic regression and BERT-base.
  • Figure 4: Accuracy curve (and Recall curve for HateSpeech) with respect to costs, using Llama 2 70B Chat as the LLM in a cascade that also comprises logistic regression and BERT-base.
  • Figure 5: Inference results on IMDB when $\mathcal{N}=3671$. Online cascade learning system performs similarly to GPT-3.5 Turbo while saving $\sim$70% of the inference costs.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 3.2
  • Definition 1.1
  • Theorem 1.1
  • proof
  • Lemma 1.2
  • proof
  • Theorem 1.2
  • proof