Beyond speculation: Measuring the growing presence of LLM-generated texts in multilingual disinformation
Dominik Macko, Aashish Anantha Ramakrishnan, Jason Samuel Lucas, Robert Moro, Ivan Srba, Adaku Uchendu, Dongwon Lee
TL;DR
The paper tackles the problem of quantifying how often LLM-generated texts appear in multilingual disinformation in real-world data. It introduces two robust, multilingual detectors (Gemma_GenAI and Gemma_MultiDomain) trained with QLoRA and evaluated on diverse benchmarks, then applies them to real-world datasets to estimate prevalence using a combined confident-detection approach. The findings show a detectable rise in LLM-generated content after the introduction of accessible chat-based LLMs, with substantial cross-language and platform variation (e.g., Polish and French showing higher relative prevalence; Telegram and Instagram showing notable levels). The study provides concrete empirical evidence to support concerns about AI-assisted disinformation and emphasizes the need for continued detection improvements and credibility indicators to safeguard information integrity across multilingual online ecosystems.
Abstract
Increased sophistication of large language models (LLMs) and the consequent quality of generated multilingual text raises concerns about potential disinformation misuse. While humans struggle to distinguish LLM-generated content from human-written texts, the scholarly debate about their impact remains divided. Some argue that heightened fears are overblown due to natural ecosystem limitations, while others contend that specific "longtail" contexts face overlooked risks. Our study bridges this debate by providing the first empirical evidence of LLM presence in the latest real-world disinformation datasets, documenting the increase of machine-generated content following ChatGPT's release, and revealing crucial patterns across languages, platforms, and time periods.
