Clustered Retrieved Augmented Generation (CRAG)
Simon Akesson, Frances A. Santos
TL;DR
CRAG tackles the token and context-window limitations of Retrieval Augmented Generation (RAG) for QA over large review corpora by a three-stage pipeline: clustering the knowledge with $k$-means, summarizing each cluster with a language model, and aggregating cluster summaries into a single, condensed prompt. The approach yields token reductions of at least $46\%$, and up to over $90\%$ in some cases, while preserving answer quality as evidenced by Cosine Similarity in the $0.7$–$0.9$ range across multiple LLMs ($GPT$-$4$, $Llama2$-$70$B, $Mixtral8x7B$). Experiments compare CRAG to a RAG baseline with ElasticSearch and show substantial cost savings (e.g., a case with 75 reviews reduces tokens from $5{,}165$ to $468$), and consistent semantic coverage of topics. The work demonstrates practical benefits for scalable, low-latency QA systems and points to future improvements via alternative clustering methods, larger open-source LLMs, and fine-tuning or few-shot prompting for summarization.
Abstract
Providing external knowledge to Large Language Models (LLMs) is a key point for using these models in real-world applications for several reasons, such as incorporating up-to-date content in a real-time manner, providing access to domain-specific knowledge, and contributing to hallucination prevention. The vector database-based Retrieval Augmented Generation (RAG) approach has been widely adopted to this end. Thus, any part of external knowledge can be retrieved and provided to some LLM as the input context. Despite RAG approach's success, it still might be unfeasible for some applications, because the context retrieved can demand a longer context window than the size supported by LLM. Even when the context retrieved fits into the context window size, the number of tokens might be expressive and, consequently, impact costs and processing time, becoming impractical for most applications. To address these, we propose CRAG, a novel approach able to effectively reduce the number of prompting tokens without degrading the quality of the response generated compared to a solution using RAG. Through our experiments, we show that CRAG can reduce the number of tokens by at least 46\%, achieving more than 90\% in some cases, compared to RAG. Moreover, the number of tokens with CRAG does not increase considerably when the number of reviews analyzed is higher, unlike RAG, where the number of tokens is almost 9x higher when there are 75 reviews compared to 4 reviews.
