Table of Contents
Fetching ...

RAGDoll: Efficient Offloading-based Online RAG System on a Single GPU

Weiping Yu, Ningyi Liao, Siqiang Luo, Junfeng Liu

TL;DR

RAGDoll tackles the challenge of running Retrieval-Augmented Generation on a single consumer-grade GPU by jointly optimizing memory placement and a parallel, multi-pipeline RAG workflow. It decouples retrieval (CPU, memory-hierarchical vector search) from generation (GPU), employs dynamic prefetching and backlog-aware batch scheduling, and adapts configurations online via active profiling. The approach combines hierarchical memory management, continuous prefetching, and adaptive scheduling to reduce idle time and balance workloads, achieving up to 3.6x lower average latency and strong resilience under dynamic workloads. This work demonstrates practical deployment of memory-intensive RAG systems on modest hardware, with broad implications for accessible, efficient LLM-powered applications in constrained environments.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language model (LLM) generation quality by incorporating relevant external knowledge. However, deploying RAG on consumer-grade platforms is challenging due to limited memory and the increasing scale of both models and knowledge bases. In this work, we introduce RAGDoll, a resource-efficient, self-adaptive RAG serving system integrated with LLMs, specifically designed for resource-constrained platforms. RAGDoll exploits the insight that RAG retrieval and LLM generation impose different computational and memory demands, which in a traditional serial workflow result in substantial idle times and poor resource utilization. Based on this insight, RAGDoll decouples retrieval and generation into parallel pipelines, incorporating joint memory placement and dynamic batch scheduling strategies to optimize resource usage across diverse hardware devices and workloads. Extensive experiments demonstrate that RAGDoll adapts effectively to various hardware configurations and LLM scales, achieving up to 3.6 times speedup in average latency compared to serial RAG systems based on vLLM.

RAGDoll: Efficient Offloading-based Online RAG System on a Single GPU

TL;DR

RAGDoll tackles the challenge of running Retrieval-Augmented Generation on a single consumer-grade GPU by jointly optimizing memory placement and a parallel, multi-pipeline RAG workflow. It decouples retrieval (CPU, memory-hierarchical vector search) from generation (GPU), employs dynamic prefetching and backlog-aware batch scheduling, and adapts configurations online via active profiling. The approach combines hierarchical memory management, continuous prefetching, and adaptive scheduling to reduce idle time and balance workloads, achieving up to 3.6x lower average latency and strong resilience under dynamic workloads. This work demonstrates practical deployment of memory-intensive RAG systems on modest hardware, with broad implications for accessible, efficient LLM-powered applications in constrained environments.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language model (LLM) generation quality by incorporating relevant external knowledge. However, deploying RAG on consumer-grade platforms is challenging due to limited memory and the increasing scale of both models and knowledge bases. In this work, we introduce RAGDoll, a resource-efficient, self-adaptive RAG serving system integrated with LLMs, specifically designed for resource-constrained platforms. RAGDoll exploits the insight that RAG retrieval and LLM generation impose different computational and memory demands, which in a traditional serial workflow result in substantial idle times and poor resource utilization. Based on this insight, RAGDoll decouples retrieval and generation into parallel pipelines, incorporating joint memory placement and dynamic batch scheduling strategies to optimize resource usage across diverse hardware devices and workloads. Extensive experiments demonstrate that RAGDoll adapts effectively to various hardware configurations and LLM scales, achieving up to 3.6 times speedup in average latency compared to serial RAG systems based on vLLM.

Paper Structure

This paper contains 22 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Representative techniques related to RAG serving, depicted by corresponding tasks and design objectives.
  • Figure 2: Pipelines in memory-intense RAG systems: (a) Standard overlapping LLM inference may misalign computation and prefetching due to CPU scheduling and compute jitter. (b) Our LLM pipeline separates computation and communication for continuous prefetching. (c) Fixed batch scheduling accumulates larger backlogs under memory-intense conditions. (d) Our backlog-aware batch scheduling adjusts flexibly to minimize backlogs.
  • Figure 3: Dissecting an online offloading-based RAG system. (a) LLM tensor placement. (b) vector database residents. (c) compute workspace scheduling.
  • Figure 4: CPU and GPU utilization and memory usage vary in a serial retrieval and generation mode when using different batch sizes under a static memory allocation policy.
  • Figure 5: Overview of RAGDoll.
  • ...and 6 more figures