Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

Majid Zarharan; Pascal Wullschleger; Babak Behkam Kia; Mohammad Taher Pilehvar; Jennifer Foster

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

Majid Zarharan, Pascal Wullschleger, Babak Behkam Kia, Mohammad Taher Pilehvar, Jennifer Foster

TL;DR

The paper tackles explainable fact-checking in public health by evaluating a broad set of LLMs across veracity prediction, explanation generation, and joint tasks using the PUBHEALTH dataset. It compares zero-shot, few-shot, and parameter-efficient fine-tuning (PEFT) regimes for closed- and open-source models, employing both automatic metrics and novel human-evaluation guidelines. Key findings show GPT-4 excels in zero-shot explanations, while PEFT enables open-source models to rival or surpass GPT-4 in few-shot or trained settings, with human judges highlighting nuanced gaps in gold explanations. The work provides new evaluation guidelines, demonstrates practical insights for deploying explainable health fact-checking systems, and discusses limitations and avenues for future research.

Abstract

This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

TL;DR

Abstract

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)