Fake News Detection After LLM Laundering: Measurement and Explanation
Rupak Kumar Das, Jonathan Dodge
TL;DR
This work investigates fake-news detection when content is generated or paraphrased by large language models. By evaluating a broad suite of detectors against real and LLM-paraphrased news (COVID-19 and LIAR datasets) using multiple paraphrasers (PEGASUS, GPT, Llama), the study reveals that detectors struggle more with paraphrased content, especially Pegasus, while GPT-based paraphrases often preserve semantic similarity as measured by $F_{BERT}$. Explainability via LIME shows sentiment shifts introduced during paraphrasing can drive misclassifications, highlighting a gap between semantic similarity and perceived sentiment. The authors contribute two paraphrase datasets, analyze detector robustness, and discuss the need for sentiment-aware evaluation metrics to improve detection in real-world misinformation scenarios.
Abstract
With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub
