Table of Contents
Fetching ...

Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Julia Belikova, Konstantin Polev, Rauf Parchiev, Dmitry Simakov

TL;DR

Large language models and RAG systems suffer from hallucinations, and annotation-heavy supervised methods limit industrial adoption. The authors present a data-efficient framework that combines internal-state feature extraction, dimensionality reduction, and lightweight meta-classification (notably TabPFNv2) to detect contextual hallucinations with as few as 50–250 labeled examples. They demonstrate competitive ROC-AUC performance on RAGBench QA benchmarks, approaching or matching strong proprietary baselines while using open-source extractors and limited data. The work highlights the practical potential of tabular foundation models and lightweight pipelines for reliable, private, low-latency hallucination detection in industry contexts.

Abstract

Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.

Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

TL;DR

Large language models and RAG systems suffer from hallucinations, and annotation-heavy supervised methods limit industrial adoption. The authors present a data-efficient framework that combines internal-state feature extraction, dimensionality reduction, and lightweight meta-classification (notably TabPFNv2) to detect contextual hallucinations with as few as 50–250 labeled examples. They demonstrate competitive ROC-AUC performance on RAGBench QA benchmarks, approaching or matching strong proprietary baselines while using open-source extractors and limited data. The work highlights the practical potential of tabular foundation models and lightweight pipelines for reliable, private, low-latency hallucination detection in industry contexts.

Abstract

Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.

Paper Structure

This paper contains 7 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Test ROC‑AUC versus training‑set size for the proposed evaluators (solid lines) across the three benchmarks (rows) and three response generators (columns). Horizontal dashed lines correspond to the zero‑shot GPT‑4o judge (yellow) and the RAGAS GPT‑4o pipeline (cyan). Shaded areas indicate ±95% confidence intervals over three random seeds.