Table of Contents
Fetching ...

DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Tim Kwang-Ting Cheng, Chi-Ying Tsui

TL;DR

DIRC-RAG introduces a digital In-ReRAM computation-based accelerator for edge Retrieval-Augmented Generation by fusing high-density ReRAM with SRAM in a ReRAM-SRAM coupled CIM, and by adopting a query-stationary dataflow to minimize data movement. The architecture supports 16 parallel cores, robust error-aware optimization (bitwise remapping and an error-detection circuit), and INT4/INT8 quantization to balance accuracy, density, and throughput. Hardware and software evaluations show record on-chip density (~5.18 Mb/mm^2), high throughput (~131 TOPS), 4 MB retrieval latency of 5.6 μs, and energy per query around 0.956 μJ, with competitive retrieval precision. These results indicate substantial improvements for energy-efficient edge RAG and highlight practical pathways for scaling via chiplet integration or conventional CIM when storage is insufficient.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6μs/query and an energy consumption of 0.956μJ/query, while maintaining the retrieval precision.

DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

TL;DR

DIRC-RAG introduces a digital In-ReRAM computation-based accelerator for edge Retrieval-Augmented Generation by fusing high-density ReRAM with SRAM in a ReRAM-SRAM coupled CIM, and by adopting a query-stationary dataflow to minimize data movement. The architecture supports 16 parallel cores, robust error-aware optimization (bitwise remapping and an error-detection circuit), and INT4/INT8 quantization to balance accuracy, density, and throughput. Hardware and software evaluations show record on-chip density (~5.18 Mb/mm^2), high throughput (~131 TOPS), 4 MB retrieval latency of 5.6 μs, and energy per query around 0.956 μJ, with competitive retrieval precision. These results indicate substantial improvements for energy-efficient edge RAG and highlight practical pathways for scaling via chiplet integration or conventional CIM when storage is insufficient.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6μs/query and an energy consumption of 0.956μJ/query, while maintaining the retrieval precision.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Demonstration of the workflow of RAG, where the document embedding storage and loading for retrieval are the bottlenecks.
  • Figure 2: Comparison between mainstream CIM memories.
  • Figure 3: (a) DIRC-RAG architecture for document retrieval with query-stationary dataflow. (b) DIRC macro with 128x128 DIRC cells, digital MAC and error detection circuit. (c) DIRC cell with 8x8 MLC ReRAM subarray, differential sensing circuit and 1bit SRAM cell.
  • Figure 4: Bit-level query-stationary dataflow in a DIRC column, which stores sixteen document embeddings and enables error detection after ReRAM sensing.
  • Figure 5: (a) The LSB spatial error map of the 8x8 ReRAM subarray from post-layout Monte Carlo simulation. (b) The error detection circuit for checking the sensing error of the DIRC cell.
  • ...and 2 more figures