"You are an expert annotator": Automatic Best-Worst-Scaling Annotations for Emotion Intensity Modeling
Christopher Bagdon, Prathamesh Karmalker, Harsha Gurulingappa, Roman Klinger
TL;DR
The paper tackles the challenge of obtaining reliable continuous emotion-intensity labels for regression tasks by comparing four annotation paradigms—rating scales (RS), rating scales with tuples (RS-T), paired comparisons (PC), and best--worst scaling (BWS)—using large language models. It demonstrates that BWS yields the most reliable automated annotations, with direct and indirect evaluations showing annotation quality approaching that of human data when training a transformer regressor. Increasing the number of annotation tuples further improves performance, and cross-model comparisons reveal GPT-3.5-turbo as the strongest current option among tested models. The findings support using BWS-based automated annotations to scale emotion-intensity regression to new datasets, while highlighting practical considerations like cost, distribution effects, and the need for broader validation on additional tasks and open models.
Abstract
Labeling corpora constitutes a bottleneck to create models for new tasks or domains. Large language models mitigate the issue with automatic corpus labeling methods, particularly for categorical annotations. Some NLP tasks such as emotion intensity prediction, however, require text regression, but there is no work on automating annotations for continuous label assignments. Regression is considered more challenging than classification: The fact that humans perform worse when tasked to choose values from a rating scale lead to comparative annotation methods, including best-worst scaling. This raises the question if large language model-based annotation methods show similar patterns, namely that they perform worse on rating scale annotation tasks than on comparative annotation tasks. To study this, we automate emotion intensity predictions and compare direct rating scale predictions, pairwise comparisons and best-worst scaling. We find that the latter shows the highest reliability. A transformer regressor fine-tuned on these data performs nearly on par with a model trained on the original manual annotations.
