Table of Contents
Fetching ...

Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Guihai Chen, Jie Wu

TL;DR

DRAGON addresses the privacy and latency challenges of retrieval-augmented generation by distributing the knowledge retrieval and generation across on-device and cloud resources. The core idea is Speculative Aggregation, a dual-side draft-then-verify mechanism that decouples aggregation from sequential decoding, enabling overlapping transmission and computation. An adaptive Greedy Scheduling module chooses the aggregation side in real time to maximize overlap under changing network conditions. Empirical evaluation on a real hardware testbed with representative SLMs and large retrieval corpora shows substantial gains over both device-only and centralized RAG, including up to 1.9x improvement in language modeling performance and large reductions in per-token latency, while preserving privacy and keeping TTFT overhead minimal.

Abstract

Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

TL;DR

DRAGON addresses the privacy and latency challenges of retrieval-augmented generation by distributing the knowledge retrieval and generation across on-device and cloud resources. The core idea is Speculative Aggregation, a dual-side draft-then-verify mechanism that decouples aggregation from sequential decoding, enabling overlapping transmission and computation. An adaptive Greedy Scheduling module chooses the aggregation side in real time to maximize overlap under changing network conditions. Empirical evaluation on a real hardware testbed with representative SLMs and large retrieval corpora shows substantial gains over both device-only and centralized RAG, including up to 1.9x improvement in language modeling performance and large reductions in per-token latency, while preserving privacy and keeping TTFT overhead minimal.

Abstract

Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

Paper Structure

This paper contains 25 sections, 5 theorems, 19 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.2

Given any distributions $\bm p^l_t$ and $\bm p^r_t$, when $\eta^r_t$ is fixed, maximizing $\alpha^l_t$ is equivalent to maximizing $\gamma^l$.

Figures (13)

  • Figure 1: Comparison between different RAG architectures.
  • Figure 2: Overview of the DRAGON framework.
  • Figure 3: Difference in per-token latencies when side $l$ and $r$ performs aggregation, versus varying $l$-side decoding latency.
  • Figure 4: Theoretical speedup of DRAGON compared to the vanilla distributed RAG vs. varying $c^l_\text{dec}$, $c^r_\text{dec}$, $\text{rtt}$ and $\alpha^r_t$.
  • Figure 5: Performance on WikiText.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Definition 4.1
  • Theorem 4.2
  • Definition 6.1
  • Theorem 6.2
  • Corollary 6.3
  • Corollary 6.4
  • Corollary 6.5