Table of Contents
Fetching ...

RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines

Quentin Romero Lauro, Shreya Shankar, Sepanta Zeighami, Aditya Parameswaran

TL;DR

RAG pipelines fuse retrieval with LLM-based generation, but debugging is hindered by intertwined components and costly re-indexing. The authors introduce raggy, a Python-based library of composable RAG primitives plus an interactive web interface that supports low-latency, what-if analysis and a persistent test suite. A formative study with practitioners informs design goals, and a user study with 12 engineers reveals retrieval-first debugging patterns, iterative sensemaking, and strong interdependencies across components. Raggy demonstrates practical value for rapid experimentation and provides design implications for future RAG development tools to improve systematic evaluation, provenance, and integration into developers' workflows.

Abstract

Retrieval-augmented generation (RAG) pipelines have become the de-facto approach for building AI assistants with access to external, domain-specific knowledge. Given a user query, RAG pipelines typically first retrieve (R) relevant information from external sources, before invoking a Large Language Model (LLM), augmented (A) with this information, to generate (G) responses. Modern RAG pipelines frequently chain multiple retrieval and generation components, in any order. However, developing effective RAG pipelines is challenging because retrieval and generation components are intertwined, making it hard to identify which component(s) cause errors in the eventual output. The parameters with the greatest impact on output quality often require hours of pre-processing after each change, creating prohibitively slow feedback cycles. To address these challenges, we present RAGGY, a developer tool that combines a Python library of composable RAG primitives with an interactive interface for real-time debugging. We contribute the design and implementation of RAGGY, insights into expert debugging patterns through a qualitative study with 12 engineers, and design implications for future RAG tools that better align with developers' natural workflows.

RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines

TL;DR

RAG pipelines fuse retrieval with LLM-based generation, but debugging is hindered by intertwined components and costly re-indexing. The authors introduce raggy, a Python-based library of composable RAG primitives plus an interactive web interface that supports low-latency, what-if analysis and a persistent test suite. A formative study with practitioners informs design goals, and a user study with 12 engineers reveals retrieval-first debugging patterns, iterative sensemaking, and strong interdependencies across components. Raggy demonstrates practical value for rapid experimentation and provides design implications for future RAG development tools to improve systematic evaluation, provenance, and integration into developers' workflows.

Abstract

Retrieval-augmented generation (RAG) pipelines have become the de-facto approach for building AI assistants with access to external, domain-specific knowledge. Given a user query, RAG pipelines typically first retrieve (R) relevant information from external sources, before invoking a Large Language Model (LLM), augmented (A) with this information, to generate (G) responses. Modern RAG pipelines frequently chain multiple retrieval and generation components, in any order. However, developing effective RAG pipelines is challenging because retrieval and generation components are intertwined, making it hard to identify which component(s) cause errors in the eventual output. The parameters with the greatest impact on output quality often require hours of pre-processing after each change, creating prohibitively slow feedback cycles. To address these challenges, we present RAGGY, a developer tool that combines a Python library of composable RAG primitives with an interactive interface for real-time debugging. We contribute the design and implementation of RAGGY, insights into expert debugging patterns through a qualitative study with 12 engineers, and design implications for future RAG tools that better align with developers' natural workflows.

Paper Structure

This paper contains 28 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Python implementation of a RAG pipeline (left) and the corresponding debugging interface generated by raggy (right). raggy dynamically generates cells for each primitive component: (A) Query, (B) Retriever, (C) LLM, and (D) Answer---enabling developers to view and manipulate intermediate outputs (e.g., chunks (B2, B3), LLM responses), tweak parameters (e.g., chunk size (B1), prompts (C1)), and build a ground-truth dataset by saving answers (D1).
  • Figure 2: In raggy's Retriever cell, users can (A) adjust hyper-parameters (e.g., retrieval method, chunk size), (B) manually select or deselect chunks, and (C) visualize the distribution of selected vs. unselected chunks.
  • Figure 3: In raggy's LLM cell, users can (1) experiment with prompting strategies and (2) directly modify the LLM’s output to test downstream effects. This example shows query decomposition, where a prompt is broken into sub-questions, each triggering its own retrieval.
  • Figure 4: Answer Interface. The Answer cell comparing a current pipeline result against a previously saved "golden" reference. With raggy, users can save and edit answers (1) and later check the similarity of new results from the pipeline against their saved ground-truth (3).
  • Figure 5: The iterative debugging workflow supported by raggy. Users can (1) run the pipeline, (2) inspect the answer, and (3) backtrack if it’s incorrect. They can then (4a) debug the LLM by editing prompts or parameters, or (4b) debug the retriever by examining retrieved chunks and tuning retrieval settings (e.g., chunk size, method). After testing changes locally and downstream (ii), users can (5) update code, (6) and save correct answers to iteratively build an evaluation set. Finally users can repeat steps 1-6 as needed to refine the pipeline.
  • ...and 1 more figures