Table of Contents
Fetching ...

Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures

Ruiyang Qin, Zheyu Yan, Dewen Zeng, Zhenge Jia, Dancheng Liu, Jianbo Liu, Zhi Zheng, Ningyuan Cao, Kai Ni, Jinjun Xiong, Yiyu Shi

TL;DR

This work tackles the latency and scalability challenges of retrieval-augmented generation on edge devices by introducing RoCR, a Robust CiM-backed RAG framework. RoCR combines contrastive learning, triplet-based data construction (CDE/CDI), and flexible noise-aware training to produce CiM-friendly embeddings that withstand non-idealities in in-memory computations. The approach is shown to outperform baselines across five CiM devices and multiple LLMs, achieving significant gains (up to ~35%) and approaching ideal RAG performance while maintaining end-to-end edge viability. By enabling fast, scalable in-situ retrieval on edge hardware, RoCR paves the way for private, personalized LLMs operating without heavy parameter updates.

Abstract

Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.

Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures

TL;DR

This work tackles the latency and scalability challenges of retrieval-augmented generation on edge devices by introducing RoCR, a Robust CiM-backed RAG framework. RoCR combines contrastive learning, triplet-based data construction (CDE/CDI), and flexible noise-aware training to produce CiM-friendly embeddings that withstand non-idealities in in-memory computations. The approach is shown to outperform baselines across five CiM devices and multiple LLMs, achieving significant gains (up to ~35%) and approaching ideal RAG performance while maintaining end-to-end edge viability. By enabling fast, scalable in-situ retrieval on edge hardware, RoCR paves the way for private, personalized LLMs operating without heavy parameter updates.

Abstract

Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
Paper Structure (19 sections, 12 equations, 7 figures, 3 tables)

This paper contains 19 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The workflow of RAG on edge-based CiM. CiM performs max inner product search (MIPS) to retrieve the top-ranked documents, concatenating them with user query to allow the LLM to generate personalized responses.
  • Figure 2: The impact on MIPS accuracy when the RAG's document embedding is perturbed by various levels of Gaussian noise caused by the device variations. An accurate retrieval means the document retrieved under the impact of the noise is the same as that retrieved without noise.
  • Figure 3: Overview of the proposed Robust CiM-backed RAG framework (RoCR). It optimizes the sentence embedding model to adapt different types of NVMs utilized by CiM.
  • Figure 4: Improvement by our Robust CiM-backed RAG. Our framework generates noise-resilient embeddings, as shown the orange and blue point in right subfigure
  • Figure 5: Examples of the two data construction methods. For data with explicit labels, CDE is used to construct the training data. For data without explicit labels (implicit labeled data), CDI is used to construct the training data.
  • ...and 2 more figures