Table of Contents
Fetching ...

ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions

Zhiyuan Peng, Jinming Nian, Alexandre Evfimievski, Yi Fang

TL;DR

ELOQ tackles the risk of hallucinations in retrieval-augmented generation by focusing on out-of-scope questions that appear related to a document but cannot be answered from its contents. It introduces a guided hallucination pipeline to automatically generate post-cutoff out-of-scope questions from news articles, paired with human verification, and demonstrates how these data improve detection and defusion capabilities. The work shows that a binary classifier operating on internal representations (unused-token probing) can outperform direct-generation approaches, enabling smaller models to rival larger ones in out-of-scope detection. The dataset and methods yield practical improvements for RAG systems, offering a route to more reliable, trustworthy AI assistants that gracefully handle queries beyond their grounded knowledge.

Abstract

Retrieval-augmented generation (RAG) has become integral to large language models (LLMs), particularly for conversational AI systems where user questions may reference knowledge beyond the LLMs' training cutoff. However, many natural user questions lack well-defined answers, either due to limited domain knowledge or because the retrieval system returns documents that are relevant in appearance but uninformative in content. In such cases, LLMs often produce hallucinated answers without flagging them. While recent work has largely focused on questions with false premises, we study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it. In this paper, we propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents, followed by human verification to ensure quality. We use this dataset to evaluate several LLMs on their ability to detect out-of-scope questions and generate appropriate responses. Finally, we introduce an improved detection method that enhances the reliability of LLM-based question-answering systems in handling out-of-scope questions.

ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions

TL;DR

ELOQ tackles the risk of hallucinations in retrieval-augmented generation by focusing on out-of-scope questions that appear related to a document but cannot be answered from its contents. It introduces a guided hallucination pipeline to automatically generate post-cutoff out-of-scope questions from news articles, paired with human verification, and demonstrates how these data improve detection and defusion capabilities. The work shows that a binary classifier operating on internal representations (unused-token probing) can outperform direct-generation approaches, enabling smaller models to rival larger ones in out-of-scope detection. The dataset and methods yield practical improvements for RAG systems, offering a route to more reliable, trustworthy AI assistants that gracefully handle queries beyond their grounded knowledge.

Abstract

Retrieval-augmented generation (RAG) has become integral to large language models (LLMs), particularly for conversational AI systems where user questions may reference knowledge beyond the LLMs' training cutoff. However, many natural user questions lack well-defined answers, either due to limited domain knowledge or because the retrieval system returns documents that are relevant in appearance but uninformative in content. In such cases, LLMs often produce hallucinated answers without flagging them. While recent work has largely focused on questions with false premises, we study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it. In this paper, we propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents, followed by human verification to ensure quality. We use this dataset to evaluate several LLMs on their ability to detect out-of-scope questions and generate appropriate responses. Finally, we introduce an improved detection method that enhances the reliability of LLM-based question-answering systems in handling out-of-scope questions.

Paper Structure

This paper contains 36 sections, 1 equation, 2 figures, 7 tables, 2 algorithms.

Figures (2)

  • Figure 1: Confusion matrix of out-of-scope and defusion on ELOQ-Gold.
  • Figure 2: Evaluation of out-of-scope detection on ELOQ-Gold.