Table of Contents
Fetching ...

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Ádám Kovács, Gábor Recski

TL;DR

This work tackles persistent extrinsic hallucinations in Retrieval-Augmented Generation by introducing LettuceDetect, a token-level hallucination detector built on ModernBERT capable of long-context reasoning (up to $8{,}192$ tokens) and trained on the RAGTruth benchmark. It formulates the problem as predicting, at the token level, whether each answer token is grounded in the provided context and question, achieving state-of-the-art-like performance among encoder-based methods with strong efficiency (approximately 30–60 examples per second on a single GPU) and compact models (~$150$M base, $396$M large). While it surpasses most prompt-based and prior encoder-based approaches, a fine-tuned LLM based on Llama-3-8B still slightly edges it out on example-level metrics. The framework is open-sourced under an MIT license, providing a practical, lightweight tool for real-world RAG deployments and a foundation for extending to more datasets, languages, and architectures.

Abstract

Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.

LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR

This work tackles persistent extrinsic hallucinations in Retrieval-Augmented Generation by introducing LettuceDetect, a token-level hallucination detector built on ModernBERT capable of long-context reasoning (up to tokens) and trained on the RAGTruth benchmark. It formulates the problem as predicting, at the token level, whether each answer token is grounded in the provided context and question, achieving state-of-the-art-like performance among encoder-based methods with strong efficiency (approximately 30–60 examples per second on a single GPU) and compact models (~M base, M large). While it surpasses most prompt-based and prior encoder-based approaches, a fine-tuned LLM based on Llama-3-8B still slightly edges it out on example-level metrics. The framework is open-sourced under an MIT license, providing a practical, lightweight tool for real-world RAG deployments and a foundation for extending to more datasets, languages, and architectures.

Abstract

Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.

Paper Structure

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A web demo of our application built in Streamlit. It features three input fields: question, context, and answer. The output shows the highlighted hallucinated spans.
  • Figure 2: The architecture of LettuceDetect. The figure illustrates an example of a Question, Context, and Answer triplet as input to our architecture. After the tokenization step, the tokens are fed into LettuceDetect for token-level classification. Tokens from both the question and the context are masked (indicated by the red line) for loss calculations. In the output of LettuceDetect, we provide probabilities for each answer token. If the output type is span-level, we aggregate subsequent tokens that are hallucinated for the span-level output.