Table of Contents
Fetching ...

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi, Brenda Curtis, Lyle H. Ungar, João Sedoc

TL;DR

This paper tackles safety in retrieval-augmented generation by introducing a lightweight, KB-aligned OOD detector that gates questions not supported by the KB. It leverages PCA on KB embeddings to form a compact subspace, selecting principal components via explained variance or a separability-driven score, and evaluates three geometric detectors plus three simple classifiers. Across 16 domains and high-stakes datasets (COVID-19 and Substance Use), the approach achieves competitive OOD detection with far lower latency and cost than LLM-based domain judges, while maintaining interpretability. End-to-end RAG experiments show that abstaining on OOD queries preserves relevance more reliably than chasing perfect correctness, underscoring the practical value of external OOD detection for safe, in-scope AI systems.

Abstract

Retrieval-Augmented Generation (RAG) systems are increasingly deployed in high-stakes domains, where safety depends not only on how a system answers, but also on whether a query should be answered given a knowledge base (KB). Out-of-domain (OOD) queries can cause dense retrieval to surface weakly related context and lead the generator to produce fluent but unjustified responses. We study lightweight, KB-aligned OOD detection as an always-on gate for RAG systems. Our approach applies PCA to KB embeddings and scores queries in a compact subspace selected either by explained-variance retention (EVR) or by a separability-driven t-test ranking. We evaluate geometric semantic-search rules and lightweight classifiers across 16 domains, including high-stakes COVID-19 and Substance Use KBs, and stress-test robustness using both LLM-generated attacks and an in-the-wild 4chan attack. We find that low-dimensional detectors achieve competitive OOD performance while being faster, cheaper, and more interpretable than prompted LLM-based judges. Finally, human and LLM-based evaluations show that OOD queries primarily degrade the relevance of RAG outputs, showing the need for efficient external OOD detection to maintain safe, in-scope behavior.

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

TL;DR

This paper tackles safety in retrieval-augmented generation by introducing a lightweight, KB-aligned OOD detector that gates questions not supported by the KB. It leverages PCA on KB embeddings to form a compact subspace, selecting principal components via explained variance or a separability-driven score, and evaluates three geometric detectors plus three simple classifiers. Across 16 domains and high-stakes datasets (COVID-19 and Substance Use), the approach achieves competitive OOD detection with far lower latency and cost than LLM-based domain judges, while maintaining interpretability. End-to-end RAG experiments show that abstaining on OOD queries preserves relevance more reliably than chasing perfect correctness, underscoring the practical value of external OOD detection for safe, in-scope AI systems.

Abstract

Retrieval-Augmented Generation (RAG) systems are increasingly deployed in high-stakes domains, where safety depends not only on how a system answers, but also on whether a query should be answered given a knowledge base (KB). Out-of-domain (OOD) queries can cause dense retrieval to surface weakly related context and lead the generator to produce fluent but unjustified responses. We study lightweight, KB-aligned OOD detection as an always-on gate for RAG systems. Our approach applies PCA to KB embeddings and scores queries in a compact subspace selected either by explained-variance retention (EVR) or by a separability-driven t-test ranking. We evaluate geometric semantic-search rules and lightweight classifiers across 16 domains, including high-stakes COVID-19 and Substance Use KBs, and stress-test robustness using both LLM-generated attacks and an in-the-wild 4chan attack. We find that low-dimensional detectors achieve competitive OOD performance while being faster, cheaper, and more interpretable than prompted LLM-based judges. Finally, human and LLM-based evaluations show that OOD queries primarily degrade the relevance of RAG outputs, showing the need for efficient external OOD detection to maintain safe, in-scope behavior.

Paper Structure

This paper contains 27 sections, 2 equations, 28 figures, 35 tables.

Figures (28)

  • Figure 1: The circle denotes the boundary of our knowledge base (the black dot). Everything inside is considered in-domain, while the question outside is classified as out-of-domain.
  • Figure 2: Two-axis view of "when not to answer" for KB-backed assistants. We focus on the vertical axis.
  • Figure 3: Distribution of the distance from the KB. The distance is defined as the minimum distance from any sample of our KB. Blue, In-Domain; Red, Out-Of-Domain; KB, knowledge base.
  • Figure 4: This prompt is employed to create the LLM-attack datasets. GPT-4o was utilized. The generation prompt was designed in two variants to increase the variety of the dataset. First, the standard prompt was used, and then a modified version was used with two additional sentences; one for encouraging the generation of concise questions, due to the initial trend of overly verbose queries, and one for favoring more statements generation, so as to approach the stlye of 4chan dataset. These were run iteratively until more than 515 unique queries were generated, which is the size of our COVID-19 dataset. This eventually yielded a total of 560 queries.
  • Figure 5: The prompt for rephrasing task. It was utilized to rephrase COVID queries for our second study. GPT-4o was prompted. The variable "type_of_question" was filled with either "question" or "command-style statement" randomly.
  • ...and 23 more figures