Table of Contents
Fetching ...

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

TL;DR

CheckEmbed addresses the verification bottleneck of open-ended LLM outputs by using whole-answer embeddings and stability-inspired aggregation. It requires no task-specific training or ground-truth and provides fast, model- and modality-agnostic verification with interpretable heatmaps. Across WikiBio, RAGTruth, and legal-term extraction, CE achieves strong alignment with ground truth, robust hallucination detection, and superior scalability compared to baselines. The framework offers a practical, deployable solution for reliable AI outputs in real-world, multimodal contexts.

Abstract

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

TL;DR

CheckEmbed addresses the verification bottleneck of open-ended LLM outputs by using whole-answer embeddings and stability-inspired aggregation. It requires no task-specific training or ground-truth and provides fast, model- and modality-agnostic verification with interpretable heatmaps. Across WikiBio, RAGTruth, and legal-term extraction, CE achieves strong alignment with ground truth, robust hallucination detection, and superior scalability compared to baselines. The framework offers a practical, deployable solution for reliable AI outputs in real-world, multimodal contexts.

Abstract

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.
Paper Structure (29 sections, 20 figures, 7 tables)

This paper contains 29 sections, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Overview of the CheckEmbed pipeline (left) and comparison between BERTScore, SelfCheckGPT, and CheckEmbed (right).
  • Figure 2: Advantages of CE in distinguishing similar and different LLM replies. We vary the used embedding model (for CE) and the used generative model (for LLM-as-a-Judge).
  • Figure 3: Analysis of the verification of LLM answers. We compare to BERTScore; SelfCheckGPT (with BERT) comes with significantly higher runtimes (detailed in Section \ref{['sec:runtimes']}) and less competitive scores as it does not focus on open-ended answer-level analysis. The results form a heatmap of the CE's, or BERTScore's, cosine similarity between all LLM replies, and between each reply and the human expert prepared ground-truth (GT). Rows correspond to two representative legal documents, that come with -- respectively -- high and low LLM confidence in its replies. Embedding model used: GPT Text Embedding Large. Generative model used: GPT-4o.
  • Figure 4: Analysis of fine-grained hallucination verification of LLM answers (GPT-4o) when summarizing legal documents.
  • Figure 5: Hallucination detection in vision models with CE. We score response quality of a vision model without real-world references or auxiliary models. Both CheckEmbed scores and correct image counts are normalized. As more items are requested, hallucinations rise, reducing correctness and CheckEmbed scores.
  • ...and 15 more figures