Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

Damián Ariel Furman; Juan Junqueras; Z. Burçe Gümüslü; Edgar Altszyler; Joaquin Navajas; Ophelia Deroy; Justin Sulik

Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

Damián Ariel Furman, Juan Junqueras, Z. Burçe Gümüslü, Edgar Altszyler, Joaquin Navajas, Ophelia Deroy, Justin Sulik

TL;DR

This work addresses mining reasons for and against vaccination from unstructured text by constructing RFAV, a bilingual (English/Spanish) dataset annotated via nichesourcing and augmented with GPT-4 and GPT-3.5-Turbo. It defines a multi-task labeling scheme for Reasons, Stances, and Scientific Authorities, and evaluates several transformer architectures including LongFormer, RoBERTa, and XLM-Roberta, highlighting the impact of data augmentation on performance. The study reports moderate inter-annotator agreement, demonstrates strong performance for reason detection with LongFormer, and reveals that augmenting with GPT-generated data can bias label distributions and, in some cases, degrade predictive accuracy. The authors release the dataset, trained models, and annotation manual to advance research on vaccine discourse and misinformation countermeasures, while emphasizing limitations around class imbalance and ethical considerations in automated analysis.

Abstract

We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task.

Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

TL;DR

Abstract

Paper Structure (25 sections, 21 figures, 14 tables)

This paper contains 25 sections, 21 figures, 14 tables.

Introduction
Previous Work
Corpus creation
Defining the task
Data annotation
Agreement
Data Statistics
Data augmentation using GPT4 and GPT3.5Turbo
Data Statistics
Experiments
RoBERTa
LongFormer
XLM Roberta
BETO
SpanBERTa
...and 10 more sections

Figures (21)

Figure 1: Distribution of labeled words per annotation class on English and Spanish expert annotated dataset
Figure 2: Distribution of labeled words per annotation class by GPT4 and GPT3.5-Turbno on English and Spanish datasets
Figure 3: Template used for generating prompts for annotation using GPT4 and GPT3.5. The final version of the prompt included three non-annotated examples linked to their correspondent annotations
Figure 4: Example 105 from test dataset labeled through nichesourcing with Reasons, Stances and Scientific Authorities
Figure 5: Example 105 from test dataset labeled by our finetuned Longformer only with Reasons
...and 16 more figures

Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

TL;DR

Abstract

Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (21)