Table of Contents
Fetching ...

A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

Zhipeng Liao, Kunming Shao, Jiangnan Yu, Liang Zhao, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Jie Yang, Mohamad Sawan

TL;DR

This work tackles privacy-preserving RAG on resource-constrained wearables by introducing a quantization-aware two-stage hierarchical retrieval. The approach uses MSB INT4-based approximate retrieval followed by INT8 full-precision re-ranking, aided by a bit-planar storage scheme and a query-stationary dataflow, to substantially cut DRAM accesses and energy while preserving retrieval accuracy. Hardware evaluations on a 28 nm design show memory access reductions of about $50\%$ and computation reductions of about $75\%$, with a per-query energy of $177.76\mu J$ for a 1 MB database; BEIR-based software tests confirm near-INT8 accuracy. Overall, the architecture offers a practical, energy-efficient solution for edge RAG-enabled wearable medical LLM agents, enabling private data usage without cloud retraining or fine-tuning.

Abstract

With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture significantly reduces energy consumption and external memory access while maintaining retrieval accuracy. Simulation results show that, under TSMC 28nm technology, the proposed hierarchical retrieval architecture has reduced the overall memory access by nearly 50% and the computation by 75% compared to pure INT8 retrieval, and the total energy consumption for 1 MB data retrieval is 177.76 μJ/query.

A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

TL;DR

This work tackles privacy-preserving RAG on resource-constrained wearables by introducing a quantization-aware two-stage hierarchical retrieval. The approach uses MSB INT4-based approximate retrieval followed by INT8 full-precision re-ranking, aided by a bit-planar storage scheme and a query-stationary dataflow, to substantially cut DRAM accesses and energy while preserving retrieval accuracy. Hardware evaluations on a 28 nm design show memory access reductions of about and computation reductions of about , with a per-query energy of for a 1 MB database; BEIR-based software tests confirm near-INT8 accuracy. Overall, the architecture offers a practical, energy-efficient solution for edge RAG-enabled wearable medical LLM agents, enabling private data usage without cloud retraining or fine-tuning.

Abstract

With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture significantly reduces energy consumption and external memory access while maintaining retrieval accuracy. Simulation results show that, under TSMC 28nm technology, the proposed hierarchical retrieval architecture has reduced the overall memory access by nearly 50% and the computation by 75% compared to pure INT8 retrieval, and the total energy consumption for 1 MB data retrieval is 177.76 μJ/query.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: RAG-enabled medical LLM agent on edge wearable device
  • Figure 2: (a) Architecture of RAG retrieval accelerator with query stationary dataflow, (b) Stage 1: MSB INT4 approximate retrieval workflow and bit-planar memory strategy, (c) Stage 2: INT8 full-precision retrieval workflow and non-division fraction comparison
  • Figure 3: Structure of the Processing Element (PE)
  • Figure 4: Memory access reduction and computation reduction for different document chunk numbers.
  • Figure 5: Data format of the retrieval method proposed in this paper - hierarchical - is compared with other data formats. (a) Normalized retrieval precisions. (b) Energy consumption per query