Hierarchical Multi-Label Classification of Online Vaccine Concerns
Chloe Qinyu Zhu, Rickard Stureborg, Bhuwan Dhingra
TL;DR
The paper tackles the challenge of detecting evolving online vaccine concerns using a zero-shot, hierarchical multi-label approach grounded in the VaxConcerns taxonomy. It evaluates seven prompting strategies across multiple large language models, introducing format demonstrations to improve output reliability and conducting a cost-performance analysis to identify Pareto-optimal designs. The standout result is GPT-4 with multi-pass binary prompting achieving an $F1$ score of 78.65, outperforming the best crowdworker baselines, with additional insights into cheaper, near-equivalent configurations. The work provides practical guidance for public health surveillance systems aiming to monitor misinformation at scale while balancing cost and accuracy.
Abstract
Vaccine concerns are an ever-evolving target, and can shift quickly as seen during the COVID-19 pandemic. Identifying longitudinal trends in vaccine concerns and misinformation might inform the healthcare space by helping public health efforts strategically allocate resources or information campaigns. We explore the task of detecting vaccine concerns in online discourse using large language models (LLMs) in a zero-shot setting without the need for expensive training datasets. Since real-time monitoring of online sources requires large-scale inference, we explore cost-accuracy trade-offs of different prompting strategies and offer concrete takeaways that may inform choices in system designs for current applications. An analysis of different prompting strategies reveals that classifying the concerns over multiple passes through the LLM, each consisting a boolean question whether the text mentions a vaccine concern or not, works the best. Our results indicate that GPT-4 can strongly outperform crowdworker accuracy when compared to ground truth annotations provided by experts on the recently introduced VaxConcerns dataset, achieving an overall F1 score of 78.7%.
