Table of Contents
Fetching ...

The Perils & Promises of Fact-checking with Large Language Models

Dorian Quelle, Alexandre Bovet

TL;DR

This study evaluates large language model (LLM) agents for automated fact-checking by having them generate queries, retrieve contextual data, and justify verdicts with cited sources using a ReAct-inspired framework. It compares GPT-3.5 and GPT-4 on the PolitiFact dataset and a large multilingual Data Commons corpus, under conditions with and without external context. Key findings show that contextual information enhances accuracy and calibration, GPT-4 outperforms GPT-3.5, and English translations improve performance on non-English claims, though results vary by language and category. The work highlights the potential and limitations of LLM-based fact-checking, emphasizing cautious deployment alongside human oversight and proposing future work on multilingual robustness and explainable reasoning.

Abstract

Automated fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. Large Language Models (LLMs) like GPT-4 are increasingly trusted to write academic papers, lawsuits, and news articles and to verify information, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. Understanding the capacities and limitations of LLMs in fact-checking tasks is therefore essential for ensuring the health of our information ecosystem. Here, we evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. Our results show the enhanced prowess of LLMs when equipped with contextual information. GPT-4 outperforms GPT-3, but accuracy varies based on query language and claim veracity. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy. Our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.

The Perils & Promises of Fact-checking with Large Language Models

TL;DR

This study evaluates large language model (LLM) agents for automated fact-checking by having them generate queries, retrieve contextual data, and justify verdicts with cited sources using a ReAct-inspired framework. It compares GPT-3.5 and GPT-4 on the PolitiFact dataset and a large multilingual Data Commons corpus, under conditions with and without external context. Key findings show that contextual information enhances accuracy and calibration, GPT-4 outperforms GPT-3.5, and English translations improve performance on non-English claims, though results vary by language and category. The work highlights the potential and limitations of LLM-based fact-checking, emphasizing cautious deployment alongside human oversight and proposing future work on multilingual robustness and explainable reasoning.

Abstract

Automated fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. Large Language Models (LLMs) like GPT-4 are increasingly trusted to write academic papers, lawsuits, and news articles and to verify information, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. Understanding the capacities and limitations of LLMs in fact-checking tasks is therefore essential for ensuring the health of our information ecosystem. Here, we evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. Our results show the enhanced prowess of LLMs when equipped with contextual information. GPT-4 outperforms GPT-3, but accuracy varies based on query language and claim veracity. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy. Our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.
Paper Structure (10 sections, 7 figures, 3 tables)

This paper contains 10 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Workflow showing how we enable LLM agents to interact with a context to assess the veracity of a claim (top). Example of the treatment of a specific claim (bottom)
  • Figure 2: Examples of Politifact statements for which the model returns correct responses. The LLM is tasked with verifying a statement made by Donald Trump, indicating that Ted Cruz is mathematically out of the race. The LLM uses Google to retrieve information on the delegate count and correctly concludes the statement is mostly true. We show the Google queries performed by the LLM and the first results of each query. In the second example, the LLM is tasked to verify a statement claiming Donald Trump's driver "did burnouts" during a race. The LLM finds information that Donald Trump did a lap around the race but correctly concludes that no information indicates that he did "burnouts". The full examples, including all Google results, are shown in the Supplementary Material.
  • Figure 3: Examples of Politifact statements for which the model returns incorrect responses. We show the Google queries performed by the LLM and the first results of each query. The LLM is asked to verify whether the Obama administration paid a ransom payment to Iran. The LLM finds information on the payment but can't conclusively confirm the purpose of the payment. It concludes that the statement is half-true. PolitiFact argues that the statement is mostly-false, as the payment is not necessarily a ransom payment. In the second example, the LLM is asked to verify whether a beer brand is American. It finds information indicating that the company is American and returns False. The company has, however, been bought by foreign investors, making the statement true. The full examples, including all Google results, are shown in the Supplementary Material.
  • Figure 4: Number of fact-checks per month in the Data Commons & PolitiFact datasets. Number of fact-checks Per Month in the Data Commons & PolitiFact Datasets. In blue (dashed) the number of fact-checks in the PolitFact Dataset are shown. The orange (solid) line indicates the number of fact-checks in the Data Commons dataset.
  • Figure 5: Accuracy of GPT-3.5 & GPT-4 overtime on the PolitiFact dataset. Yearly rolling average of the accuracy of LLMs over Time. Panel (A) displays the accuracy of GPT-4. Panel (B) shows the accuracy of GPT-3.5. The blue line indicates the context condition, and the orange line indicates the no-context condition. The vertical line represents the training end date of both models according to OpenAI. A faint line is the three-month average. The x-axis represents the date of the claim. The bands represent one standard error.
  • ...and 2 more figures