Table of Contents
Fetching ...

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

Hasan Iqbal, Yuxia Wang, Minghan Wang, Georgi Georgiev, Jiahui Geng, Iryna Gurevych, Preslav Nakov

TL;DR

OpenFactCheck tackles the pervasive problem of factuality in open-domain LLM outputs by introducing a unified, modular evaluation framework. It combines three core modules—ResponseEval for customizable claim verification pipelines, LLMEval for multi-faceted LLM factuality assessment using a consolidated FactQA benchmark, and CheckerEval to evaluate automatic fact-checkers via a public leaderboard. The framework enables seamless integration of existing claim processors, retrievers, and verifiers, and provides accessible Python and web interfaces to facilitate broad adoption and fair benchmarking across studies. By unifying evaluation datasets, metrics, and tooling, OpenFactCheck aims to standardize comparisons, drive improvements in factuality, and support safer deployment of LLMs in high-stakes contexts. The work outlines a clear path for future enhancements, including multilingual support, expanded datasets, and consideration of temporal dynamics and transparency in factuality evaluation.

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

TL;DR

OpenFactCheck tackles the pervasive problem of factuality in open-domain LLM outputs by introducing a unified, modular evaluation framework. It combines three core modules—ResponseEval for customizable claim verification pipelines, LLMEval for multi-faceted LLM factuality assessment using a consolidated FactQA benchmark, and CheckerEval to evaluate automatic fact-checkers via a public leaderboard. The framework enables seamless integration of existing claim processors, retrievers, and verifiers, and provides accessible Python and web interfaces to facilitate broad adoption and fair benchmarking across studies. By unifying evaluation datasets, metrics, and tooling, OpenFactCheck aims to standardize comparisons, drive improvements in factuality, and support safer deployment of LLMs in high-stakes contexts. The work outlines a clear path for future enhancements, including multilingual support, expanded datasets, and consideration of temporal dynamics and transparency in factuality evaluation.

Abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
Paper Structure (27 sections, 4 figures, 3 tables)

This paper contains 27 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the OpenFactCheck demo system for LLM factuality evaluation and its modules. Green ResponseEval: a customized fact-checker to identify factual errors given text inputs. Orange LLMEval: an LLM factuality evaluator to assess the LLM factual ability from different aspects and then to produce a report to illustrate its weaknesses and strengths. Purple CheckerEval: a fact-checker evaluator and leaderboard to encourage the development of advanced checkers in terms of performance, latency and costs.
  • Figure 2: Usage examples of three major modules: ResponseEval, LLMEval and CheckerEval.
  • Figure 3: OpenFactCheck Dashboard: (a) is the navigation bar. (b) a claim processor breaking down the input into two atomic claims. The retriever collected 16 pieces of evidence, and the verifier assessed each claim individually, with one true and one false, resulting 50% credibility overall. (c) shows the user information required before uploading LLM responses or verification results to LLMEval and CheckerEval. (d) shows the functions of downloading and uploading. (e) and (f) exhibit the LLM and FactChecker Evaluation report respectively.
  • Figure 4: Pseudo code for classes in ResponseEval.