Annotation Tool and Dataset for Fact-Checking Podcasts
Vinay Setty, Adam James Becker
TL;DR
This work tackles misinformation in long-form, multilingual podcasts by introducing an open-source tool for real-time transcription and annotation during playback, preserving context while enabling check-worthiness and claim-span identification. The authors build an end-to-end pipeline using Whisper for transcription, Pyannote for diarization, SpaCy for sentence segmentation, and F-Coref for co-reference, coupled with crowdsourced annotations to create a rich, multilingual dataset. They fine-tune XLM-Roberta-Large for claim detection and stance classification, and compare results with GPT-4 in few-shot scenarios, demonstrating the viability of smaller transformers for fact-checking tasks. The released transcripts and annotations enable end-to-end fact-checking experiments and provide a practical, scalable resource for researchers and practitioners to develop robust, multilingual fact-checking models for podcasts.
Abstract
Podcasts are a popular medium on the web, featuring diverse and multilingual content that often includes unverified claims. Fact-checking podcasts is a challenging task, requiring transcription, annotation, and claim verification, all while preserving the contextual details of spoken content. Our tool offers a novel approach to tackle these challenges by enabling real-time annotation of podcasts during playback. This unique capability allows users to listen to the podcast and annotate key elements, such as check-worthy claims, claim spans, and contextual errors, simultaneously. By integrating advanced transcription models like OpenAI's Whisper and leveraging crowdsourced annotations, we create high-quality datasets to fine-tune multilingual transformer models such as XLM-RoBERTa for tasks like claim detection and stance classification. Furthermore, we release the annotated podcast transcripts and sample annotations with preliminary experiments.
