Table of Contents
Fetching ...

A Generative Caching System for Large Language Models

Arun Iyengar, Ashish Kundu, Ramana Kompella, Sai Nandan Mamidi

TL;DR

The paper tackles the high latency and monetary cost of accessing large language models by introducing GenerativeCache, a caching system that supports generative caching and an enhanced client for coordinating multiple LLMs. It combines semantic caching with adaptive similarity thresholds and a two-threshold mechanism ($t_{single}$, $t_{combined}$) to generate answers from multiple cached responses, potentially caching the generated output for future use. The architecture is hierarchical and modular, enabling flexible data-store options (e.g., Redis Stack, Milvus) and both interactive and automatic operation modes, along with parallel and asynchronous LLM querying. Experimental results on SQuAD show embedding computation as the main overhead and indicate GenerativeCache achieves approximately 9x higher throughput than GPTCache, underscoring significant improvements in latency and cost efficiency for multi-LLM deployment scenarios.

Abstract

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money for queries; caching thus has a clear monetary benefit. This paper presents a new caching system for improving user experiences with LLMs. In addition to reducing both latencies and monetary costs for accessing LLMs, our system also provides important features that go beyond the performance benefits typically associated with caches. A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before. Our generative caches function as repositories of valuable information which can be mined and analyzed. We also improve upon past semantic caching techniques by tailoring the caching algorithms to optimally balance cost and latency reduction with the quality of responses provided. Performance tests indicate that our caches are considerably faster than GPTcache.

A Generative Caching System for Large Language Models

TL;DR

The paper tackles the high latency and monetary cost of accessing large language models by introducing GenerativeCache, a caching system that supports generative caching and an enhanced client for coordinating multiple LLMs. It combines semantic caching with adaptive similarity thresholds and a two-threshold mechanism (, ) to generate answers from multiple cached responses, potentially caching the generated output for future use. The architecture is hierarchical and modular, enabling flexible data-store options (e.g., Redis Stack, Milvus) and both interactive and automatic operation modes, along with parallel and asynchronous LLM querying. Experimental results on SQuAD show embedding computation as the main overhead and indicate GenerativeCache achieves approximately 9x higher throughput than GPTCache, underscoring significant improvements in latency and cost efficiency for multi-LLM deployment scenarios.

Abstract

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money for queries; caching thus has a clear monetary benefit. This paper presents a new caching system for improving user experiences with LLMs. In addition to reducing both latencies and monetary costs for accessing LLMs, our system also provides important features that go beyond the performance benefits typically associated with caches. A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before. Our generative caches function as repositories of valuable information which can be mined and analyzed. We also improve upon past semantic caching techniques by tailoring the caching algorithms to optimally balance cost and latency reduction with the quality of responses provided. Performance tests indicate that our caches are considerably faster than GPTcache.

Paper Structure

This paper contains 13 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: A distributed hierarchical generative caching system.
  • Figure 2: Our enhanced client for LLMs, with integrated caching.
  • Figure 3: Key components of our system which can be customized and swapped out for other components.
  • Figure 4: Average time in milliseconds to add query-result pairs to a cache.
  • Figure 5: Average time in milliseconds to lookup query-result pairs in a cache.
  • ...and 2 more figures