Table of Contents
Fetching ...

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang

TL;DR

METIS targets the core RAG bottleneck: the tradeoff between response quality and latency. It introduces a two-stage, LLM-guided approach that first profiles a query to prune the configuration space and then jointly schedules and selects a per-query RAG configuration under current GPU memory, achieving substantial latency reductions ($1.64$–$2.54\times$) and higher throughput without sacrificing quality. The framework demonstrates that per-query knob adaptation across synthesis strategy, chunk count, and summary length, when coupled with resource-aware scheduling, yields significant gains across four RAG-QA datasets. A light-weight profiler and memory-aware best-fit scheduling enable practical deployment with negligible overhead and improved efficiency. This approach advances practical, scalable RAG systems by enabling per-query optimization of the entire configuration space under real-time resource constraints.

Abstract

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

TL;DR

METIS targets the core RAG bottleneck: the tradeoff between response quality and latency. It introduces a two-stage, LLM-guided approach that first profiles a query to prune the configuration space and then jointly schedules and selects a per-query RAG configuration under current GPU memory, achieving substantial latency reductions () and higher throughput without sacrificing quality. The framework demonstrates that per-query knob adaptation across synthesis strategy, chunk count, and summary length, when coupled with resource-aware scheduling, yields significant gains across four RAG-QA datasets. A light-weight profiler and memory-aware best-fit scheduling enable practical deployment with negligible overhead and improved efficiency. This approach advances practical, scalable RAG systems by enabling per-query optimization of the entire configuration space under real-time resource constraints.

Abstract

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by without sacrificing generation quality.

Paper Structure

This paper contains 20 sections, 19 figures, 1 table, 1 algorithm.

Figures (19)

  • Figure 1: Performance of METIS on the KG RAG FinSec kgrag dataset compared to the baselines. Full results shown in § \ref{['sec:eval']}.
  • Figure 2: The configuration knobs adapted by METIS are derived from key design choices of RAG systems.
  • Figure 3: Illustration of different RAG synthesis methods, which have various LLM reasoning capabilities.
  • Figure 4: Varying each RAG configuration knob leads to different quality-latency tradeoffs, and these tradeoffs differ across queries (Q1 in green, Q2 in blue, and Q3 in red).
  • Figure 5: Per-query configuration can achieve significantly better quality-delay tradeoffs across queries compared to every fixed configuration choice.
  • ...and 14 more figures