FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua
TL;DR
FACT-AUDIT introduces an adaptive multi-agent framework for dynamic fact-checking evaluation of large language models, integrating importance sampling with a taxonomy-driven data generation pipeline to reveal model-specific limitations in factual verification and justification. The framework leverages Appraiser, Inquirer, Quality Inspector, Evaluator, and Prober agents to build and iteratively refine test scenarios, while tracking performance via metrics like Insight Mastery Rate ($IMR$), Justification Flaw Rate ($JFR$), and Grade. It demonstrates model-centric auditing across 13 LLMs, uncovering performance gaps between open-source and proprietary models and across test modes, with justification quality providing complementary insights to verdict accuracy. The results support dynamic updating of test data and taxonomy as a scalable approach to robust, real-world fact-checking evaluation, offering practical implications for reliability, bias mitigation, and ongoing model improvement.
Abstract
Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
