CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Maciej Besta; Lorenzo Paleari; Marcin Copik; Robert Gerstenberger; Ales Kubicek; Piotr Nyczyk; Patrick Iff; Eric Schreiber; Tanja Srindran; Tomasz Lehmann; Hubert Niewiadomski; Torsten Hoefler

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler

TL;DR

CheckEmbed addresses the verification bottleneck of open-ended LLM outputs by using whole-answer embeddings and stability-inspired aggregation. It requires no task-specific training or ground-truth and provides fast, model- and modality-agnostic verification with interpretable heatmaps. Across WikiBio, RAGTruth, and legal-term extraction, CE achieves strong alignment with ground truth, robust hallucination detection, and superior scalability compared to baselines. The framework offers a practical, deployable solution for reliable AI outputs in real-world, multimodal contexts.

Abstract

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

TL;DR

Abstract

Paper Structure (29 sections, 20 figures, 7 tables)

This paper contains 29 sections, 20 figures, 7 tables.

Introduction
The CheckEmbed Design
Complexity Analysis
Evaluation
Analysis of Distinguishing Similar and Different Text Passages
Analysis of LLM Answer Verification
Beyond LLMs To Other Modalities
Analysis of Scalability
Analysis of Varying the Number of Sampled LLM Answers
Related Work
Conclusion
Details on Complexity Analysis
Computing Similarity of Two Passages
Verifying an Open-Ended Task Answer
Specification of Prompts
...and 14 more sections

Figures (20)

Figure 1: Overview of the CheckEmbed pipeline (left) and comparison between BERTScore, SelfCheckGPT, and CheckEmbed (right).
Figure 2: Advantages of CE in distinguishing similar and different LLM replies. We vary the used embedding model (for CE) and the used generative model (for LLM-as-a-Judge).
Figure 3: Analysis of the verification of LLM answers. We compare to BERTScore; SelfCheckGPT (with BERT) comes with significantly higher runtimes (detailed in Section \ref{['sec:runtimes']}) and less competitive scores as it does not focus on open-ended answer-level analysis. The results form a heatmap of the CE's, or BERTScore's, cosine similarity between all LLM replies, and between each reply and the human expert prepared ground-truth (GT). Rows correspond to two representative legal documents, that come with -- respectively -- high and low LLM confidence in its replies. Embedding model used: GPT Text Embedding Large. Generative model used: GPT-4o.
Figure 4: Analysis of fine-grained hallucination verification of LLM answers (GPT-4o) when summarizing legal documents.
Figure 5: Hallucination detection in vision models with CE. We score response quality of a vision model without real-world references or auxiliary models. Both CheckEmbed scores and correct image counts are normalized. As more items are requested, hallucinations rise, reducing correctness and CheckEmbed scores.
...and 15 more figures

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

TL;DR

Abstract

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (20)