Table of Contents
Fetching ...

Hierarchical Multi-Label Classification of Online Vaccine Concerns

Chloe Qinyu Zhu, Rickard Stureborg, Bhuwan Dhingra

TL;DR

The paper tackles the challenge of detecting evolving online vaccine concerns using a zero-shot, hierarchical multi-label approach grounded in the VaxConcerns taxonomy. It evaluates seven prompting strategies across multiple large language models, introducing format demonstrations to improve output reliability and conducting a cost-performance analysis to identify Pareto-optimal designs. The standout result is GPT-4 with multi-pass binary prompting achieving an $F1$ score of 78.65, outperforming the best crowdworker baselines, with additional insights into cheaper, near-equivalent configurations. The work provides practical guidance for public health surveillance systems aiming to monitor misinformation at scale while balancing cost and accuracy.

Abstract

Vaccine concerns are an ever-evolving target, and can shift quickly as seen during the COVID-19 pandemic. Identifying longitudinal trends in vaccine concerns and misinformation might inform the healthcare space by helping public health efforts strategically allocate resources or information campaigns. We explore the task of detecting vaccine concerns in online discourse using large language models (LLMs) in a zero-shot setting without the need for expensive training datasets. Since real-time monitoring of online sources requires large-scale inference, we explore cost-accuracy trade-offs of different prompting strategies and offer concrete takeaways that may inform choices in system designs for current applications. An analysis of different prompting strategies reveals that classifying the concerns over multiple passes through the LLM, each consisting a boolean question whether the text mentions a vaccine concern or not, works the best. Our results indicate that GPT-4 can strongly outperform crowdworker accuracy when compared to ground truth annotations provided by experts on the recently introduced VaxConcerns dataset, achieving an overall F1 score of 78.7%.

Hierarchical Multi-Label Classification of Online Vaccine Concerns

TL;DR

The paper tackles the challenge of detecting evolving online vaccine concerns using a zero-shot, hierarchical multi-label approach grounded in the VaxConcerns taxonomy. It evaluates seven prompting strategies across multiple large language models, introducing format demonstrations to improve output reliability and conducting a cost-performance analysis to identify Pareto-optimal designs. The standout result is GPT-4 with multi-pass binary prompting achieving an score of 78.65, outperforming the best crowdworker baselines, with additional insights into cheaper, near-equivalent configurations. The work provides practical guidance for public health surveillance systems aiming to monitor misinformation at scale while balancing cost and accuracy.

Abstract

Vaccine concerns are an ever-evolving target, and can shift quickly as seen during the COVID-19 pandemic. Identifying longitudinal trends in vaccine concerns and misinformation might inform the healthcare space by helping public health efforts strategically allocate resources or information campaigns. We explore the task of detecting vaccine concerns in online discourse using large language models (LLMs) in a zero-shot setting without the need for expensive training datasets. Since real-time monitoring of online sources requires large-scale inference, we explore cost-accuracy trade-offs of different prompting strategies and offer concrete takeaways that may inform choices in system designs for current applications. An analysis of different prompting strategies reveals that classifying the concerns over multiple passes through the LLM, each consisting a boolean question whether the text mentions a vaccine concern or not, works the best. Our results indicate that GPT-4 can strongly outperform crowdworker accuracy when compared to ground truth annotations provided by experts on the recently introduced VaxConcerns dataset, achieving an overall F1 score of 78.7%.
Paper Structure (15 sections, 4 figures, 4 tables)

This paper contains 15 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Prompting Strategies Workflows We experimented with 7 combinations of prompting and logic-passing style. Each prompting strategy is outlined above. Boxes represent stand-alone API calls to the language model, such as a single OpenAI request. The content of the box describes what final labels are requested in the respective prompt. Arrows show any logic that is carried out across these API requests, which only occurs if using a hierarchical-pass strategy.
  • Figure 2: Mean failure rates in single-pass classification for various models, under both zero-shot and with format demonstrations. Format demonstrations vastly reduce the failure rates in all models. GPT-4* models have a near-zero failure rate due to their strength in controlability for respecting instructed output formats. Failures occur due to formatting errors or missing labels, described further in Appendix A.
  • Figure 3: Inference costs for each prompting strategy with format demonstration. Despite using the same model, costs can vary massively (multi-pass binary is 9.4x more expensive than hrchl-pass multi). Costs of single-pass strategies are higher due to needing format demonstrations to reduce the failure rate to a reasonable level, with GPT-3.5-Turbo for example (\ref{['fig:failure_rates']}). Overall, performing hierarchical passes in small groups of labels is the cheapest prompting strategy by far, while binary labeling (seeing only one label at a time) is the most expensive. Cost is given for the whole dataset of 200 examples.
  • Figure 4: Total cost versus performance by model and prompting strategy. Throughout our experiments, we show a positive relationship between cost of prediction and performance. However, this relationship is largely driven by model cost differences. Yet, the relationship between the cost of prompting strategies and their performance is positive. This could potentially hint that models perform better when focusing on fewer labels per generation.