ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories
Zhiwei Liu, Paul Thompson, Jiaqi Rong, Sophia Ananiadou
TL;DR
Conspiracy theories can be generated and covertly spread by LLMs, especially when sentiment is manipulated to evade detectors. The authors augment the ConDID dataset with sentiment-transformed rewrites (ConDID-v2) using GPT-4o and train ConspEmoLLM-v2 on this data, using EmoLLaMA to verify sentiment shifts, with performance compared across multiple baselines. Results show that ConspEmoLLM-v2 preserves or improves performance on original ConDID content and exhibits strong robustness to sentiment-based attacks on ConDID-v2, outperforming baselines in the transformed-data scenario. This work provides a robust conspiracy-detection framework capable of withstanding adversarial sentiment manipulation, with sentiment strength quantified by $S \in [0,1]$ to characterize emotion intensity.
Abstract
Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at https://github.com/lzw108/ConspEmoLLM.
