Table of Contents
Fetching ...

Accelerating Retrieval-Augmented Generation

Derrick Quinn, Mohammad Nouri, Neel Patel, John Salihu, Alireza Salemi, Sukhan Lee, Hamed Zamani, Mohammad Alian

TL;DR

This work tackles the retrieval bottleneck in Retrieval-Augmented Generation by evaluating exact versus approximate nearest-neighbor search and demonstrating that exact search, when accelerated, yields substantial end-to-end gains. It introduces Intelligent Knowledge Store (IKS), a near-memory, CXL-based memory expander that offloads ENNS to an array of NMAs, achieving 13.4–27.9× ENNS speedups and 1.7–26.3× end-to-end RAG speedups on representative tasks. IKS leverages a cache-coherent interface (CXL.cache) and a scale-out architecture to maximize memory bandwidth efficiency while keeping embedding data local near memory. The results show that near-memory ENNS can outperform GPU-based retrieval for large vector stores, offering higher capacity, comparable bandwidth, and favorable power and cost characteristics, making near-memory acceleration a practical path for scalable, high-recall RAG systems.

Abstract

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today's servers, from being stranded.

Accelerating Retrieval-Augmented Generation

TL;DR

This work tackles the retrieval bottleneck in Retrieval-Augmented Generation by evaluating exact versus approximate nearest-neighbor search and demonstrating that exact search, when accelerated, yields substantial end-to-end gains. It introduces Intelligent Knowledge Store (IKS), a near-memory, CXL-based memory expander that offloads ENNS to an array of NMAs, achieving 13.4–27.9× ENNS speedups and 1.7–26.3× end-to-end RAG speedups on representative tasks. IKS leverages a cache-coherent interface (CXL.cache) and a scale-out architecture to maximize memory bandwidth efficiency while keeping embedding data local near memory. The results show that near-memory ENNS can outperform GPU-based retrieval for large vector stores, offering higher capacity, comparable bandwidth, and favorable power and cost characteristics, making near-memory acceleration a practical path for scalable, high-recall RAG systems.

Abstract

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today's servers, from being stranded.

Paper Structure

This paper contains 31 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of the Retrieval-Augmented Generation (RAG) pipeline.
  • Figure 2: Generation accuracy vs. throughput (Queries/sec) of representative RAG applications for various retrieval algorithms and document counts (K). The corpus size is set to 50 GB and batch size to 16.
  • Figure 3: Latency breakdown of FiDT5, Llama-8B, Llama-70B for various values of K, corpus sizes. All configurations use batch size 1. Retrieval is ENNS and runs on CPU, generation runs on a single NVIDIA H100 (SXM) for all generative models. The value in each bar shows the absolute retrieval time.
  • Figure 4: Roofline model for ENNS using Batch Size 1 and 16. See Section \ref{['sec:expr:method']} for the experimental setup.
  • Figure 5: (a) IKS internal DRAM, scratchpad spaces, and configuration registers are mapped to the host address space. The scratchpad and configuration register address ranges are labeled as Context Buffers (CB). (b) IKS is a compute-enabled CXL memory expander that includes eight LPDDR5X packages with one near-memory accelerator (NMA) chip near each package. (c) Each NMA includes 64 processing engines. (d) Dot-product units reuse the query vector (QV) dimension across 68 MAC units.
  • ...and 6 more figures