Table of Contents
Fetching ...

EdgeRAG: Online-Indexed RAG for Edge Devices

Korakit Seemakhupt, Sihang Liu, Samira Khan

TL;DR

EdgeRAG tackles memory constraints for running Retrieval Augmented Generation on edge devices by pruning second-level embeddings, generating embeddings on demand, and adaptively caching heavy-cluster embeddings. It uses a two-level IVF-like index with selective embedding storage and a dynamic caching strategy to balance memory footprint and latency. Evaluated on BEIR benchmarks and a Jetson Orin Nano platform, EdgeRAG delivers around 1.8x TTFT improvement on average (3.82x on large datasets) with generation quality within ~5% of the Flat baseline, and all datasets fitting into memory. This approach enables practical, low-latency RAG-enabled mobile assistants that leverage local data while avoiding memory thrashing.

Abstract

Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.

EdgeRAG: Online-Indexed RAG for Edge Devices

TL;DR

EdgeRAG tackles memory constraints for running Retrieval Augmented Generation on edge devices by pruning second-level embeddings, generating embeddings on demand, and adaptively caching heavy-cluster embeddings. It uses a two-level IVF-like index with selective embedding storage and a dynamic caching strategy to balance memory footprint and latency. Evaluated on BEIR benchmarks and a Jetson Orin Nano platform, EdgeRAG delivers around 1.8x TTFT improvement on average (3.82x on large datasets) with generation quality within ~5% of the Flat baseline, and all datasets fitting into memory. This approach enables practical, low-latency RAG-enabled mobile assistants that leverage local data while avoiding memory thrashing.

Abstract

Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.
Paper Structure (26 sections, 13 figures, 4 tables, 3 algorithms)

This paper contains 26 sections, 13 figures, 4 tables, 3 algorithms.

Figures (13)

  • Figure 1: RAG Pipelines.
  • Figure 2: Retrieval process of Inverted File Index
  • Figure 3: RAG latency breakdown and embedded database size
  • Figure 4: Embedding Generation Rate of different cluster size
  • Figure 5: Cluster Embedding Generation Cost of nq dataset
  • ...and 8 more figures