GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching
Sajal Regmi, Chetan Phakami Pun
TL;DR
GPT Semantic Cache tackles the cost and latency bottlenecks of frequent LLM API calls by caching semantic embeddings of queries in Redis and retrieving responses for semantically similar questions using an ANN index. The approach supports OpenAI embeddings or local ONNX models, and uses a cosine similarity threshold of $0.8$ to decide cache reuse, with an approximate $O(\log n)$ search complexity via HNSW. In experiments on $8{,}000$ Q–A pairs and $2{,}000$ test queries across four categories, API calls were reduced by up to $68.8\%$, with cache hit rates from $92.5\%$ to $97.3\%$ and positive hit rates above $97\%$, while achieving substantial latency improvements. The method demonstrates significant practical impact for customer support, RAG, real-time code assistance, and e-commerce, and points to future work in dynamic thresholding, distributed caching, and domain-specific embeddings to further enhance performance.
Abstract
Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications.
