Table of Contents
Fetching ...

GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching

Sajal Regmi, Chetan Phakami Pun

TL;DR

GPT Semantic Cache tackles the cost and latency bottlenecks of frequent LLM API calls by caching semantic embeddings of queries in Redis and retrieving responses for semantically similar questions using an ANN index. The approach supports OpenAI embeddings or local ONNX models, and uses a cosine similarity threshold of $0.8$ to decide cache reuse, with an approximate $O(\log n)$ search complexity via HNSW. In experiments on $8{,}000$ Q–A pairs and $2{,}000$ test queries across four categories, API calls were reduced by up to $68.8\%$, with cache hit rates from $92.5\%$ to $97.3\%$ and positive hit rates above $97\%$, while achieving substantial latency improvements. The method demonstrates significant practical impact for customer support, RAG, real-time code assistance, and e-commerce, and points to future work in dynamic thresholding, distributed caching, and domain-specific embeddings to further enhance performance.

Abstract

Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications.

GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching

TL;DR

GPT Semantic Cache tackles the cost and latency bottlenecks of frequent LLM API calls by caching semantic embeddings of queries in Redis and retrieving responses for semantically similar questions using an ANN index. The approach supports OpenAI embeddings or local ONNX models, and uses a cosine similarity threshold of to decide cache reuse, with an approximate search complexity via HNSW. In experiments on Q–A pairs and test queries across four categories, API calls were reduced by up to , with cache hit rates from to and positive hit rates above , while achieving substantial latency improvements. The method demonstrates significant practical impact for customer support, RAG, real-time code assistance, and e-commerce, and points to future work in dynamic thresholding, distributed caching, and domain-specific embeddings to further enhance performance.

Abstract

Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications.

Paper Structure

This paper contains 33 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: The diagram illustrates the core components of the GPT Semantic Cache system, showcasing the flow of user queries through embedding generation, similarity calculation rahutomo2012semantic, and Approximate Nearest Neighbors (ANN) indexing. The Redis-based in-memory cache stores embeddings and corresponding responses, facilitating quick retrieval. Queries are sent to the GPT API only when no matching response is found in the cache.
  • Figure 2: Comparison of API Call Frequency: Traditional Query Handling vs Semantic Caching System.
  • Figure 3: Comparison of Average Query Response Times: With Cache vs Without Cache.
  • Figure 4: Cache Hit Rates and Positive Match Accuracy Across Query Categories.