Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
Damián Ariel Furman, Juan Junqueras, Z. Burçe Gümüslü, Edgar Altszyler, Joaquin Navajas, Ophelia Deroy, Justin Sulik
TL;DR
This work addresses mining reasons for and against vaccination from unstructured text by constructing RFAV, a bilingual (English/Spanish) dataset annotated via nichesourcing and augmented with GPT-4 and GPT-3.5-Turbo. It defines a multi-task labeling scheme for Reasons, Stances, and Scientific Authorities, and evaluates several transformer architectures including LongFormer, RoBERTa, and XLM-Roberta, highlighting the impact of data augmentation on performance. The study reports moderate inter-annotator agreement, demonstrates strong performance for reason detection with LongFormer, and reveals that augmenting with GPT-generated data can bias label distributions and, in some cases, degrade predictive accuracy. The authors release the dataset, trained models, and annotation manual to advance research on vaccine discourse and misinformation countermeasures, while emphasizing limitations around class imbalance and ethical considerations in automated analysis.
Abstract
We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task.
