Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing
Mao Li, Frederick Conrad
TL;DR
This paper evaluates stance annotation in social media by benchmarking eight LLMs and crowd-sourced judgments on citizenship-related tweets from the 2020 Census. It introduces a System 1 System 2 explicitness framework and uses human judgment variability as a proxy for text explicitness, analyzing agreement via logistic regression and bootstrapped F1 across corpora. The findings show LLMs perform best on explicit, System 1 style posts and struggle with implicit, System 2 inferences, with Few-Shot prompting helping but not eliminating gaps. The work advocates a hybrid approach combining human expertise with scalable LLM predictions and offers practical guidance on prompts and corpus-aware annotation to improve robustness and scalability of stance detection.
Abstract
In the rapidly evolving landscape of Natural Language Processing (NLP), the use of Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. Despite the impressive innovations in developing LLMs like ChatGPT, their efficacy, and accuracy as annotation tools are not well understood. In this paper, we analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts, benchmarking their performance against human annotators' (i.e., crowd-sourced) judgments. Additionally, we investigate the conditions under which LLMs are likely to disagree with human judgment. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'. We argue that LLMs perform well when human annotators do, and when LLMs fail, it often corresponds to situations in which human annotators struggle to reach an agreement. We conclude with recommendations for a comprehensive approach that combines the precision of human expertise with the scalability of LLM predictions. This study highlights the importance of improving the accuracy and comprehensiveness of automated stance detection, aiming to advance these technologies for more efficient and unbiased analysis of social media.
