Table of Contents
Fetching ...

CoEdge-RAG: Optimizing Hierarchical Scheduling for Retrieval-Augmented LLMs in Collaborative Edge Computing

Guihang Hong, Tao Ouyang, Kongyange Zhao, Zhi Zhou, Xu Chen

TL;DR

CoEdge-RAG addresses the challenge of real-time, privacy-preserving retrieval-augmented LLMs across distributed edge nodes. It integrates a PPO-based online query identifier, a capacity-aware inter-node scheduler, and an online convex intra-node scheduler to jointly optimize latency and generation quality under heterogeneous resources. The approach demonstrates robust improvements on QA benchmarks (4.23% to 91.39% gains) and effectively handles spatiotemporal query skew and privacy constraints, enabling scalable edge intelligence. By leveraging distributed data and heterogeneous compute, CoEdge-RAG provides a practical framework for high-quality, low-latency edge LLM serving in real-world deployments.

Abstract

Motivated by the imperative for real-time responsiveness and data privacy preservation, large language models (LLMs) are increasingly deployed on resource-constrained edge devices to enable localized inference. To improve output quality, retrieval-augmented generation (RAG) is an efficient technique that seamlessly integrates local data into LLMs. However, existing edge computing paradigms primarily focus on single-node optimization, neglecting opportunities to holistically exploit distributed data and heterogeneous resources through cross-node collaboration. To bridge this gap, we propose CoEdge-RAG, a hierarchical scheduling framework for retrieval-augmented LLMs in collaborative edge computing. In general, privacy constraints preclude accurate a priori acquisition of heterogeneous data distributions across edge nodes, directly impeding RAG performance optimization. Thus, we first design an online query identification mechanism using proximal policy optimization (PPO), which autonomously infers query semantics and establishes cross-domain knowledge associations in an online manner. Second, we devise a dynamic inter-node scheduling strategy that balances workloads across heterogeneous edge nodes by synergizing historical performance analytics with real-time resource thresholds. Third, we develop an intra-node scheduler based on online convex optimization, adaptively allocating query processing ratios and memory resources to optimize the latency-quality trade-off under fluctuating assigned loads. Comprehensive evaluations across diverse QA benchmarks demonstrate that our proposed method significantly boosts the performance of collaborative retrieval-augmented LLMs, achieving performance gains of 4.23\% to 91.39\% over baseline methods across all tasks.

CoEdge-RAG: Optimizing Hierarchical Scheduling for Retrieval-Augmented LLMs in Collaborative Edge Computing

TL;DR

CoEdge-RAG addresses the challenge of real-time, privacy-preserving retrieval-augmented LLMs across distributed edge nodes. It integrates a PPO-based online query identifier, a capacity-aware inter-node scheduler, and an online convex intra-node scheduler to jointly optimize latency and generation quality under heterogeneous resources. The approach demonstrates robust improvements on QA benchmarks (4.23% to 91.39% gains) and effectively handles spatiotemporal query skew and privacy constraints, enabling scalable edge intelligence. By leveraging distributed data and heterogeneous compute, CoEdge-RAG provides a practical framework for high-quality, low-latency edge LLM serving in real-world deployments.

Abstract

Motivated by the imperative for real-time responsiveness and data privacy preservation, large language models (LLMs) are increasingly deployed on resource-constrained edge devices to enable localized inference. To improve output quality, retrieval-augmented generation (RAG) is an efficient technique that seamlessly integrates local data into LLMs. However, existing edge computing paradigms primarily focus on single-node optimization, neglecting opportunities to holistically exploit distributed data and heterogeneous resources through cross-node collaboration. To bridge this gap, we propose CoEdge-RAG, a hierarchical scheduling framework for retrieval-augmented LLMs in collaborative edge computing. In general, privacy constraints preclude accurate a priori acquisition of heterogeneous data distributions across edge nodes, directly impeding RAG performance optimization. Thus, we first design an online query identification mechanism using proximal policy optimization (PPO), which autonomously infers query semantics and establishes cross-domain knowledge associations in an online manner. Second, we devise a dynamic inter-node scheduling strategy that balances workloads across heterogeneous edge nodes by synergizing historical performance analytics with real-time resource thresholds. Third, we develop an intra-node scheduler based on online convex optimization, adaptively allocating query processing ratios and memory resources to optimize the latency-quality trade-off under fluctuating assigned loads. Comprehensive evaluations across diverse QA benchmarks demonstrate that our proposed method significantly boosts the performance of collaborative retrieval-augmented LLMs, achieving performance gains of 4.23\% to 91.39\% over baseline methods across all tasks.

Paper Structure

This paper contains 20 sections, 17 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Generation quality comparison.
  • Figure 2: Latency comparison with different skewnesses.
  • Figure 3: Generation quality and latency performances.
  • Figure 4: An illustration of CoEdge-RAG.
  • Figure 5: Generation quality of different scheduling strategies.
  • ...and 1 more figures