Limited Effectiveness of LLM-based Data Augmentation for COVID-19 Misinformation Stance Detection
Eun Cheol Choi, Ashwin Balasubramanian, Jinhu Qi, Emilio Ferrara
TL;DR
The paper investigates data augmentation for COVID-19 misinformation stance detection using controllable misinformation generation (CMG) with large language models. It adopts a two-stage finetuning setup using general NLI pretraining and COVID-19 SD fine-tuning, comparing CMG against traditional augmentation methods across varying data sizes. The results show CMG yields only marginal or inconsistent gains, largely due to safeguards that cause refusals and task flipping during generation. The authors release code and datasets to promote further research and highlight the need for more flexible or domain-specific augmentation strategies. The findings underscore that current CMG approaches may not reliably outperform simpler augmentation in misinformation-sensitive NLP tasks.
Abstract
Misinformation surrounding emerging outbreaks poses a serious societal threat, making robust countermeasures essential. One promising approach is stance detection (SD), which identifies whether social media posts support or oppose misleading claims. In this work, we finetune classifiers on COVID-19 misinformation SD datasets consisting of claims and corresponding tweets. Specifically, we test controllable misinformation generation (CMG) using large language models (LLMs) as a method for data augmentation. While CMG demonstrates the potential for expanding training datasets, our experiments reveal that performance gains over traditional augmentation methods are often minimal and inconsistent, primarily due to built-in safeguards within LLMs. We release our code and datasets to facilitate further research on misinformation detection and generation.
