Table of Contents
Fetching ...

REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu

TL;DR

This paper addresses the bottleneck in Retrieval-Augmented Generation (RAG) caused by I/O data movement during the retrieval stage. It introduces REIS, an In-Storage Processing (ISP) system tailored for RAG that links embeddings to documents, uses an IVF-friendly ISP workflow, and employs an in-storage ANNS engine to perform distance computations within the storage device. The approach yields substantial improvements in retrieval throughput and energy efficiency compared to CPU baselines ($\approx$ $13\times$ speedup and $\approx$ $55\times$ energy efficiency) and shows strong gains over prior ISP accelerators and VN-based ANNS solutions. By minimizing data movement and leveraging storage-internal parallelism, REIS demonstrates a practical, hardware-friendly path to accelerating RAG pipelines at scale with notable end-to-end latency reductions and energy savings.

Abstract

Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).

REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

TL;DR

This paper addresses the bottleneck in Retrieval-Augmented Generation (RAG) caused by I/O data movement during the retrieval stage. It introduces REIS, an In-Storage Processing (ISP) system tailored for RAG that links embeddings to documents, uses an IVF-friendly ISP workflow, and employs an in-storage ANNS engine to perform distance computations within the storage device. The approach yields substantial improvements in retrieval throughput and energy efficiency compared to CPU baselines ( speedup and energy efficiency) and shows strong gains over prior ISP accelerators and VN-based ANNS solutions. By minimizing data movement and leveraging storage-internal parallelism, REIS demonstrates a practical, hardware-friendly path to accelerating RAG pipelines at scale with notable end-to-end latency reductions and energy savings.

Abstract

Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).

Paper Structure

This paper contains 40 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: NAND Flash Memory Architecture
  • Figure 2: Latency breakdown for a typical RAG pipeline. Total time is displayed next to each bar.
  • Figure 3: Latency breakdown for a RAG pipeline using Binary Quantization (BQ). Total time is displayed next to each bar.
  • Figure 4: Overview of REIS.
  • Figure 5: Comparison of ANNS algorithms in terms of throughput and recall running on CPU. For IVF, nlist denotes the number of clusters for a dataset. For HNSW, M denotes the number of neighbors for each vertex.
  • ...and 6 more figures