Table of Contents
Fetching ...

SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling

TL;DR

The paper tackles data scarcity in stance detection for online political discussions by introducing two LLM-driven strategies: per-question synthetic data augmentation to improve fine-tuning, and SQBC, a synthetic-data-driven Query By Comittee that uses embedding similarity to identify the most informative unlabelled samples for labeling. SQBC treats the synthetic data as an oracle, selecting samples whose proximity to synthetic exemplars yields the most information for the model when labeled. Experiments on the X-Stance dataset show that synthetic augmentation improves performance over baselines, and SQBC substantially reduces labeling effort while often matching or exceeding full-data performance, especially when combined with synthetic data. Collectively, these findings highlight the practical value of synthetic data for grounding per-question stance and enabling efficient, targeted annotation in politically contextual NLP tasks.

Abstract

Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.

SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

TL;DR

The paper tackles data scarcity in stance detection for online political discussions by introducing two LLM-driven strategies: per-question synthetic data augmentation to improve fine-tuning, and SQBC, a synthetic-data-driven Query By Comittee that uses embedding similarity to identify the most informative unlabelled samples for labeling. SQBC treats the synthetic data as an oracle, selecting samples whose proximity to synthetic exemplars yields the most information for the model when labeled. Experiments on the X-Stance dataset show that synthetic augmentation improves performance over baselines, and SQBC substantially reduces labeling effort while often matching or exceeding full-data performance, especially when combined with synthetic data. Collectively, these findings highlight the practical value of synthetic data for grounding per-question stance and enabling efficient, targeted annotation in politically contextual NLP tasks.

Abstract

Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
Paper Structure (22 sections, 6 equations, 2 figures, 2 tables)

This paper contains 22 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: SQBC (Synthetic Data-driven Query By Comittee) can reduce labelling while improving stance detection performance: Larger values of $\kappa$ correspond to fewer amount of samples chosen for manual labelling (see Eq. \ref{['eq:kappa']}), hence reducing labelling effort. All bars above the dashed line represent performance superior to the baseline. Using the chosen samples for manual labelling together with the remaining (not chosen) samples for fine-tuning delivers the best overall performance. In some cases, labelling effort can be reduced substantially (by needing only 20% of the data), while still delivering better performance than the model with all true labels. If a dataset is already labelled, then using synthetic data to augment the dataset also improves the performance of the stance detection model.
  • Figure 2: Extended results of Figure \ref{['fig:results']}: We present the results for each of the 5 questions we chose from the X-Stance test set. The numbers on the x-axis correspond to the amount of samples chosen for manual labelling. These correspond to the values of $\kappa$ in Figure \ref{['fig:results']}.