GenAI vs. Human Fact-Checkers: Accurate Ratings, Flawed Rationales
Yuehong Cassandra Tai, Khushi Navin Patni, Nicholas Daniel Hemauer, Bruce Desmarais, Yu-Ru Lin
TL;DR
...This study investigates the ability of Generative AI (GenAI) to rate the credibility of online content and to elucidate the reasoning processes behind such judgments. Using zero-shot prompts, the authors evaluate GPT-4o, Llama 3.1-8b/70b, Gemma2-9b, and Flan-T5-XL on 415 accessible posts from low-credibility domains shared by U.S. state legislators on Facebook, comparing model outputs to human coders with intercoder reliability. They examine both full and summarized content, reporting that GPT-4o generally achieves the best accuracy (macro F1 ≈ 0.82; MCC ≈ 0.66) while human agreement remains moderate, and that summarization can improve efficiency for some models (notably Gemma2-9b). A qualitative analysis of GPT-4o explanations reveals eleven reasoning patterns across three categories (content quality, language objectivity, legislative behaviors), with a heavy reliance on “hard” criteria like detail, sources, and formality, raising concerns about causal hallucination and underscoring the value of human-in-the-loop approaches for misinformation detection in political contexts.
Abstract
Despite recent advances in understanding the capabilities and limits of generative artificial intelligence (GenAI) models, we are just beginning to understand their capacity to assess and reason about the veracity of content. We evaluate multiple GenAI models across tasks that involve the rating of, and perceived reasoning about, the credibility of information. The information in our experiments comes from content that subnational U.S. politicians post to Facebook. We find that GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders. Importantly, even when GenAI models accurately identify low-credibility content, their reasoning relies heavily on linguistic features and ``hard'' criteria, such as the level of detail, source reliability, and language formality, rather than an understanding of veracity. We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy. While GenAI has the potential to support human fact-checkers in scaling misinformation detection, our results caution against relying solely on these models.
