Table of Contents
Fetching ...

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska

TL;DR

PipeRAG addresses latency and quality bottlenecks in retrieval-augmented generation (RAG) by co-designing both the algorithm and the system. It introduces pipeline parallelism, flexible retrieval intervals, and a performance-model-driven retrieval policy to overlap retrieval with inference and to adapt retrieval effort to hardware and context. Empirical results show up to 2.6× end-to-end latency reduction and perplexity improvements on large databases, outperforming Retro across multiple datasets. The work demonstrates the practical potential of algorithm-system co-design to unlock efficient, high-quality RAG in future systems.

Abstract

Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we introduce PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. PipeRAG integrates (1) pipeline parallelism to enable concurrent retrieval and generation processes, (2) flexible retrieval intervals to maximize the efficiency of pipeline parallelism, and (3) a performance model to automatically balance retrieval quality and latency based on the generation states and underlying hardware. Our evaluation shows that, by combining the three aforementioned methods, PipeRAG achieves up to 2.6$\times$ speedup in end-to-end generation latency while improving generation quality. These promising results showcase the effectiveness of co-designing algorithms with underlying systems, paving the way for the adoption of PipeRAG in future RAG systems.

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

TL;DR

PipeRAG addresses latency and quality bottlenecks in retrieval-augmented generation (RAG) by co-designing both the algorithm and the system. It introduces pipeline parallelism, flexible retrieval intervals, and a performance-model-driven retrieval policy to overlap retrieval with inference and to adapt retrieval effort to hardware and context. Empirical results show up to 2.6× end-to-end latency reduction and perplexity improvements on large databases, outperforming Retro across multiple datasets. The work demonstrates the practical potential of algorithm-system co-design to unlock efficient, high-quality RAG in future systems.

Abstract

Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we introduce PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. PipeRAG integrates (1) pipeline parallelism to enable concurrent retrieval and generation processes, (2) flexible retrieval intervals to maximize the efficiency of pipeline parallelism, and (3) a performance model to automatically balance retrieval quality and latency based on the generation states and underlying hardware. Our evaluation shows that, by combining the three aforementioned methods, PipeRAG achieves up to 2.6 speedup in end-to-end generation latency while improving generation quality. These promising results showcase the effectiveness of co-designing algorithms with underlying systems, paving the way for the adoption of PipeRAG in future RAG systems.
Paper Structure (21 sections, 10 figures, 2 tables)

This paper contains 21 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Based on three performance-centric observations (O1$\sim$O3), PipeRAG combines a system-aware algorithm integrating pipeline parallelism (S1) with flexible retrieval intervals (S2) and an algorithm-aware retrieval system guided by a performance model (S3).
  • Figure 2: Retrieval-augmented generation with Retro.
  • Figure 3: Attention mechanisms and query windows in PipeRAG.
  • Figure 4: The effect of database sizes and retrieval strategies on language modeling perplexity (lower perplexity means higher quality).
  • Figure 5: Perplexity of retrieval-augmented generation when applying various retrieval intervals and search space configurations ($nprobe$).
  • ...and 5 more figures