Table of Contents
Fetching ...

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

Abstract

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

Abstract

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of MemBoost. For each incoming query, the AME retrieves a small set of relevant memory entries. The MC then either composes an answer from the retrieved results or escalates to the Oracle. When the Oracle is used, the MC decides whether to write the new answer back into the AME for future reuse.
  • Figure 2: Average memory-use rate $\overline{I}_t$ (200-step window) over a 5,000-step Zipf-sampled query stream. Higher $\overline{I}_t$ indicates more queries served from AME and fewer oracle calls, implying lower total inference cost.
  • Figure 3: Response latency over time under Zipf-sampled workloads (average over the previous 100 steps). MemBoost reduces latency relative to the oracle-only baseline as an increasing fraction of queries are served from AME.