Table of Contents
Fetching ...

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua

TL;DR

FACT-AUDIT introduces an adaptive multi-agent framework for dynamic fact-checking evaluation of large language models, integrating importance sampling with a taxonomy-driven data generation pipeline to reveal model-specific limitations in factual verification and justification. The framework leverages Appraiser, Inquirer, Quality Inspector, Evaluator, and Prober agents to build and iteratively refine test scenarios, while tracking performance via metrics like Insight Mastery Rate ($IMR$), Justification Flaw Rate ($JFR$), and Grade. It demonstrates model-centric auditing across 13 LLMs, uncovering performance gaps between open-source and proprietary models and across test modes, with justification quality providing complementary insights to verdict accuracy. The results support dynamic updating of test data and taxonomy as a scalable approach to robust, real-world fact-checking evaluation, offering practical implications for reliability, bias mitigation, and ongoing model improvement.

Abstract

Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

TL;DR

FACT-AUDIT introduces an adaptive multi-agent framework for dynamic fact-checking evaluation of large language models, integrating importance sampling with a taxonomy-driven data generation pipeline to reveal model-specific limitations in factual verification and justification. The framework leverages Appraiser, Inquirer, Quality Inspector, Evaluator, and Prober agents to build and iteratively refine test scenarios, while tracking performance via metrics like Insight Mastery Rate (), Justification Flaw Rate (), and Grade. It demonstrates model-centric auditing across 13 LLMs, uncovering performance gaps between open-source and proprietary models and across test modes, with justification quality providing complementary insights to verdict accuracy. The results support dynamic updating of test data and taxonomy as a scalable approach to robust, real-world fact-checking evaluation, offering practical implications for reliability, bias mitigation, and ongoing model improvement.

Abstract

Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.

Paper Structure

This paper contains 39 sections, 7 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: The pipelines of the existing solution and the proposed Fact-Audit in fact-checking evaluation.
  • Figure 2: An overview of Fact-Audit, to adaptively unveil the limitations of fact-checking in LLMs.
  • Figure 3: The taxonomy of fact-checking scenarios.
  • Figure 4: IMR of two most challenging test scenarios in each fact-checking objects, with the averaged IMR.
  • Figure 5: The effect of different iterations in scaling.
  • ...and 14 more figures