SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Jiaxing Li; Chi Xu; Feng Wang; Isaac M von Riedemann; Cong Zhang; Jiangchuan Liu

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang, Jiangchuan Liu

TL;DR

SCALM addresses cache inefficiency in LLMChat services by exploiting semantic patterns in real-world dialogues. It introduces two hierarchical semantic clustering methods, CO-HSC and SE-HSC, and a token-saving metric to rank and cache cost-saving patterns beyond traditional hit-rate metrics. Through prototype integration with GPTCache and evaluation on LMSYS and MOSS data, SCALM achieves substantial gains in cache hit ratio (63%) and token savings (77%) over GPTCache, translating to lower operational costs and improved scalability. The approach offers a practical path to more efficient and cost-effective LLM-based chat services in real-world workloads.

Abstract

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In this work, we for the first time conducted an analysis on real-world human-to-LLM interaction data, identifying key challenges in existing caching solutions for LLM-based chat services. Our findings reveal that current caching methods fail to leverage semantic connections, leading to inefficient cache performance and extra token costs. To address these issues, we propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. We also detail the implementations of the corresponding cache storage and eviction strategies. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services. Compared with other state-of-the-art solutions in GPTCache, SCALM shows, on average, a relative increase of 63% in cache hit ratio and a relative improvement of 77% in tokens savings.

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 4 equations, 8 figures, 3 tables, 2 algorithms.

Introduction
Related Works
Motivation and Observation
Preliminary
Data-Driven Analysis
Challenges and Opportunities
Semantics-Oriented Cache
Design Overview
Hierarchical Semantic Clustering
Prototype Implementation and Evaluation
Method and Parameter Settings
Adaptive Storage Strategy
Adaptive Eviction Strategy
Metrics
Performance Evaluation
...and 2 more sections

Figures (8)

Figure 1: General caching workflow for LLMChat services.
Figure 2: Round distribution of the MOSS and LMSYS datasets.
Figure 3: Hit ratios of different categories in MOSS dataset.
Figure 4: An overview of the SCALM architecture.
Figure 5: Comparative analysis of hit ratios and token saving ratios with different conversation scales.
...and 3 more figures

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

TL;DR

Abstract

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)