Table of Contents
Fetching ...

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

TL;DR

We introduce MedFact, a Chinese medical fact-checking benchmark built with a hybrid AI-human pipeline to ensure realism, difficulty, and uncontaminated data across 2,116 expert-annotated texts spanning 13 specialties. We benchmark 20 LLMs on veracity classification and error localization, observing a persistent gap to human performance and a notable over-criticism phenomenon when employing advanced reasoning strategies like retrieval-augmented generation and multi-agent collaboration. External grounding improves detection but often harms precision, and error localization remains challenging due to shallow medical understanding in many models. MedFact provides datasets and methodological insights to guide the development of factually reliable medical AI and discusses practical considerations around data use, fairness, and deployment in healthcare contexts.

Abstract

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

TL;DR

We introduce MedFact, a Chinese medical fact-checking benchmark built with a hybrid AI-human pipeline to ensure realism, difficulty, and uncontaminated data across 2,116 expert-annotated texts spanning 13 specialties. We benchmark 20 LLMs on veracity classification and error localization, observing a persistent gap to human performance and a notable over-criticism phenomenon when employing advanced reasoning strategies like retrieval-augmented generation and multi-agent collaboration. External grounding improves detection but often harms precision, and error localization remains challenging due to shallow medical understanding in many models. MedFact provides datasets and methodological insights to guide the development of factually reliable medical AI and discusses practical considerations around data use, fairness, and deployment in healthcare contexts.

Abstract

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

Paper Structure

This paper contains 40 sections, 48 figures, 14 tables.

Figures (48)

  • Figure 1: An overview of the data construction pipeline and the components of MedFact.
  • Figure 2: An example of fact-checking performed by Claude 3.7 Sonnet.
  • Figure 3: Distribution of the specialties in MedFact.
  • Figure 4: Error distribution of DeepSeek-R1 and XiaoYi on the EL task (zero-shot).
  • Figure 5: Zero-shot performance of different models on MedFact across different writing styles.
  • ...and 43 more figures