Table of Contents
Fetching ...

MUCH: A Multilingual Claim Hallucination Benchmark

Jérémie Dentan, Alexi Canesse, Davide Buscaldi, Aymen Shabou, Sonia Vanier

TL;DR

MUCH catalogs the reliability of LLM outputs through a multilingual claim-level uncertainty benchmark, providing 4 languages, 4 open-weight models, and 24 per-token logits to enable white-box UQ research. It introduces much_segmenter, a fast deterministic claim segmentation tool, and automates claim-level annotations via GPT-4o and GPT-4.1 with a gold-standard human subset for quality checks. The dataset comprises 4.8k samples and 20,751 claims, accompanied by generation configurations and runtime statistics to support real-time deployment considerations. Evaluations of existing baselines reveal meaningful performance gaps and efficiency trade-offs, underscoring the need for stronger, language-robust, and computation-efficient claim-level UQ methods. Overall, MUCH advances fair, reproducible, and deployment-aware evaluation for multilingual claim-level uncertainty quantification, providing a solid foundation for future white-box UQ research.

Abstract

Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

MUCH: A Multilingual Claim Hallucination Benchmark

TL;DR

MUCH catalogs the reliability of LLM outputs through a multilingual claim-level uncertainty benchmark, providing 4 languages, 4 open-weight models, and 24 per-token logits to enable white-box UQ research. It introduces much_segmenter, a fast deterministic claim segmentation tool, and automates claim-level annotations via GPT-4o and GPT-4.1 with a gold-standard human subset for quality checks. The dataset comprises 4.8k samples and 20,751 claims, accompanied by generation configurations and runtime statistics to support real-time deployment considerations. Evaluations of existing baselines reveal meaningful performance gaps and efficiency trade-offs, underscoring the need for stronger, language-robust, and computation-efficient claim-level UQ methods. Overall, MUCH advances fair, reproducible, and deployment-aware evaluation for multilingual claim-level uncertainty quantification, providing a solid foundation for future white-box UQ research.

Abstract

Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.

Paper Structure

This paper contains 39 sections, 2 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: We open-source four artifacts as part of the MUCH benchmark: (1) 4,873 LLM generations spanning four languages (English, French, Spanish, and German) and four models (Llama 3.1 8B, Llama 3.2 3B, Ministral 8B, and Gemma 3 4B); (2) 24 logits per generation token; (3) $\texttt{much\_segmenter}$, a fast and reproducible claim segmenter; and (4) claim-level factuality annotations for every sample, totaling 20,751 binary annotations. This framework facilitates the evaluation of future methods, which only requires defining a new token-level score, aggregating it, and comparing it to the claim-level annotations.
  • Figure 2: Construction pipeline of MUCH benchmark. We filter English, French, Spanish, and German questions from the Mu-SHROOM test set mushroom (see [A]). We then generate eight LLM answers per question, and retain 24 logits per generated token (see [B]). Next, we use $\texttt{much\_segmenter}$ to parse LLM generations (see [C]). We automatically assign two binary labels to each claim, one using GPT-4o and one using GPT-4.1 (see [D]). Finally, we retain only high-quality annotations by filtering out samples where GPT-4o and GPT-4.1 labels mismatch on at least one claim (see [E]).
  • Figure 3: Confusion matrices comparing claim annotations from GPT-4o, GPT-4.1, and human annotators (an0, an1), before and after sample filtering (see [E] in \ref{['fig:pipeline']} and \ref{['sec:methodology_filtering']}). Cohen’s kappa ($\kappa$) quantifies inter-annotator agreement. \ref{['fig:correlation_matrix_gpts']}: GPT-4o vs GPT-4.1 on the 34.1k claims of the 6.4k samples before filtering. \ref{['fig:correlation_human_before']}: two human annotators on the 511 claims of 100 random samples before filtering. \ref{['fig:correlation_an0']}-\ref{['fig:correlation_humans']}: GPT vs an0 vs an1 on the 865 claims of 200 random samples after filtering.
  • Figure 4: Evaluation of baseline methods on MUCH. The state-of-the-art CCP method fadeeva_fact-checking_2024 outperforms other approaches, but there remains considerable room for improvement.
  • Figure 5: Annotation instructions
  • ...and 5 more figures