"The Data Says Otherwise"-Towards Automated Fact-checking and Communication of Data Claims
Yu Fu, Shunan Guo, Jane Hoffswell, Victor S. Bursztyn, Ryan Rossi, John Stasko
TL;DR
The paper tackles misinformation arising from data-driven claims by introducing Aletheia, an automated fact-checking prototype that uses an LLM-based backend to map textual data claims to data facts, retrieve relevant evidence from datasets, and present results via data tables or visualizations. It adapts a six-component framework (data claim detection, text-to-data mapping, data evidence retrieval, veracity determination, data evidence presentation, and end-user interaction) to data claims, and implements a seven-step prompting pipeline to generate data fact specifications. Through a curated dataset of 400 claims across 10 data-fact types and a mixed-method user study with 20 participants, the authors show that visualization representations generally reduce review time and increase user confidence, while also revealing domain-specific design considerations. They offer four design recommendations for presenting data evidence and discuss limitations and future work, highlighting Aletheia’s potential to support data journalism and broader data-intensive communication while mitigating misinformation.
Abstract
Fact-checking data claims requires data evidence retrieval and analysis, which can become tedious and intractable when done manually. This work presents Aletheia, an automated fact-checking prototype designed to facilitate data claims verification and enhance data evidence communication. For verification, we utilize a pre-trained LLM to parse the semantics for evidence retrieval. To effectively communicate the data evidence, we design representations in two forms: data tables and visualizations, tailored to various data fact types. Additionally, we design interactions that showcase a real-world application of these techniques. We evaluate the performance of two core NLP tasks with a curated dataset comprising 400 data claims and compare the two representation forms regarding viewers' assessment time, confidence, and preference via a user study with 20 participants. The evaluation offers insights into the feasibility and bottlenecks of using LLMs for data fact-checking tasks, potential advantages and disadvantages of using visualizations over data tables, and design recommendations for presenting data evidence.
