Table of Contents
Fetching ...

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

Waris Gill, Justin Cechmanek, Tyler Hutcherson, Srijith Rajamohan, Jen Agarwal, Muhammad Ali Gulzar, Manvinder Singh, Benoit Dion

TL;DR

This work addresses semantic caching for LLM-based services by favoring compact, domain-specific embeddings over large models. It introduces LangCache-Embed, a ModernBERT-based embedding tuned with online contrastive learning on domain data, complemented by a synthetic data pipeline to enable domain adaptation with limited labeled data. Empirical results on Quora and medical datasets show state-of-the-art performance, with notable gains from domain fine-tuning and additional improvements from synthetic data, while careful tuning avoids catastrophic forgetting and maintains cross-domain generalization. The findings demonstrate a practical, efficient path to high-precision semantic caching, balancing latency and accuracy for real-world deployment.

Abstract

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

TL;DR

This work addresses semantic caching for LLM-based services by favoring compact, domain-specific embeddings over large models. It introduces LangCache-Embed, a ModernBERT-based embedding tuned with online contrastive learning on domain data, complemented by a synthetic data pipeline to enable domain adaptation with limited labeled data. Empirical results on Quora and medical datasets show state-of-the-art performance, with notable gains from domain fine-tuning and additional improvements from synthetic data, while careful tuning avoids catastrophic forgetting and maintains cross-domain generalization. The findings demonstrate a practical, efficient path to high-precision semantic caching, balancing latency and accuracy for real-world deployment.

Abstract

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.

Paper Structure

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of embedding-model performance on the Quora dataset. The y-axis shows score, while the x-axis lists metrics. LangCache-Embed (i.e., fine-tuned ModernBERT) exhibits a significant uplift in precision and recall compared to its baseline (non-fine-tuned) version and other state-of-the-art embedding models, highlighting the impact of fine-tuning.
  • Figure 2: Evaluation of different embedding models on a specialized medical dataset. Notably, LangCache-Embed outperforms both large-scale open-source and closed-source baselines, demonstrating that lightweight models adapted to domain-specific data can achieve state-of-the-art results.
  • Figure 3: The plots compare performance on the target (fine-tuning) dataset versus performance on an unseen or previously learned dataset. a) Overly extensive fine-tuning on a single domain degrades the model’s generalization to out-of-domain queries, by reduced precision. b) Limiting fine-tuning (e.g., to a single epoch and moderate gradient norm) mitigates catastrophic forgetting and preserves strong cross-domain performance.
  • Figure 4: This plot illustrates the trade-off between embedding generation overhead (x-axis, measured in seconds) and average precision on the Quora test set (y-axis). Each point represents a different embedding model, including both open-source and commercial offerings. Models in the upper-left region deliver high precision at low embedding time. LangCache-Embed (finetuned ModernBERT) stands out for combining rapid inference with top-tier performance, indicating it is an ideal choice for real-time semantic caching where both speed and accuracy are critical.