METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
TL;DR
METIS targets the core RAG bottleneck: the tradeoff between response quality and latency. It introduces a two-stage, LLM-guided approach that first profiles a query to prune the configuration space and then jointly schedules and selects a per-query RAG configuration under current GPU memory, achieving substantial latency reductions ($1.64$–$2.54\times$) and higher throughput without sacrificing quality. The framework demonstrates that per-query knob adaptation across synthesis strategy, chunk count, and summary length, when coupled with resource-aware scheduling, yields significant gains across four RAG-QA datasets. A light-weight profiler and memory-aware best-fit scheduling enable practical deployment with negligible overhead and improved efficiency. This approach advances practical, scalable RAG systems by enabling per-query optimization of the entire configuration space under real-time resource constraints.
Abstract
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.
