Table of Contents
Fetching ...

ORION Grounded in Context: Retrieval-Based Method for Hallucination Detection

Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok, Yaron Friedman, Neal Harow, Noam Bressler, Shir Chorev, Philip Tannor

TL;DR

The paper tackles hallucinations in grounded generation (RAG and abstractive tasks) by introducing Grounded in Context, a lightweight, retrieval-based detector within the ORION framework designed for production-scale long-context data. It decomposes outputs into discrete factual claims and grounds each claim with retrieved context, using an encoder-based NLI to compute entailment probabilities $p_{ent}$ for claim–context pairs and aggregating these signals to detect inconsistencies. On the RAGTruth benchmark, Grounded in Context achieves an $F_1$ of $0.83$, competitive with larger models trained on the dataset, and demonstrates strong performance for models of its size while highlighting some task-specific limitations. The work provides a practical approach for production QA in long-context scenarios and suggests future enhancements via longer-context encoders (e.g., ModernBERT) to further improve grounding and reduce hallucinations.

Abstract

Despite advancements in grounded content generation, production Large Language Models (LLMs) based applications still suffer from hallucinated answers. We present "Grounded in Context" - a member of Deepchecks' ORION (Output Reasoning-based InspectiON) family of lightweight evaluation models. It is our framework for hallucination detection, designed for production-scale long-context data and tailored to diverse use cases, including summarization, data extraction, and RAG. Inspired by RAG architecture, our method integrates retrieval and Natural Language Inference (NLI) models to predict factual consistency between premises and hypotheses using an encoder-based model with only a 512-token context window. Our framework identifies unsupported claims with an F1 score of 0.83 in RAGTruth's response-level classification task, matching methods that trained on the dataset, and outperforming all comparable frameworks using similar-sized models.

ORION Grounded in Context: Retrieval-Based Method for Hallucination Detection

TL;DR

The paper tackles hallucinations in grounded generation (RAG and abstractive tasks) by introducing Grounded in Context, a lightweight, retrieval-based detector within the ORION framework designed for production-scale long-context data. It decomposes outputs into discrete factual claims and grounds each claim with retrieved context, using an encoder-based NLI to compute entailment probabilities for claim–context pairs and aggregating these signals to detect inconsistencies. On the RAGTruth benchmark, Grounded in Context achieves an of , competitive with larger models trained on the dataset, and demonstrates strong performance for models of its size while highlighting some task-specific limitations. The work provides a practical approach for production QA in long-context scenarios and suggests future enhancements via longer-context encoders (e.g., ModernBERT) to further improve grounding and reduce hallucinations.

Abstract

Despite advancements in grounded content generation, production Large Language Models (LLMs) based applications still suffer from hallucinated answers. We present "Grounded in Context" - a member of Deepchecks' ORION (Output Reasoning-based InspectiON) family of lightweight evaluation models. It is our framework for hallucination detection, designed for production-scale long-context data and tailored to diverse use cases, including summarization, data extraction, and RAG. Inspired by RAG architecture, our method integrates retrieval and Natural Language Inference (NLI) models to predict factual consistency between premises and hypotheses using an encoder-based model with only a 512-token context window. Our framework identifies unsupported claims with an F1 score of 0.83 in RAGTruth's response-level classification task, matching methods that trained on the dataset, and outperforming all comparable frameworks using similar-sized models.

Paper Structure

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: : Distribution of the number of tokens in a claim after LLM generated sequences were split by our chunker without setting a maximum token size.