Tailoring Vaccine Messaging with Common-Ground Opinions

Rickard Stureborg; Sanxing Chen; Ruoyu Xie; Aayushi Patel; Christopher Li; Chloe Qinyu Zhu; Tingnan Hu; Jun Yang; Bhuwan Dhingra

Tailoring Vaccine Messaging with Common-Ground Opinions

Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra

TL;DR

The paper addresses the challenge of tailoring vaccine messaging to common-ground opinions to counter misinformation and hesitancy. It proposes Tailor-CGO, a comprehensive dataset and evaluation framework, featuring 22,400 CGO-tailored responses from 6 LLMs and inputs consisting of 1,166 concerns and 1,167 CGO opinions sourced from the VaxConcerns taxonomy and OpinionQA. The authors develop automatic evaluation metrics by using GPT-4-Turbo for zero-shot scoring and distilling these signals into BERT and Llama-2 models, finding that BERT fine-tuning yields the strongest automatic judgments while GPT-4-Turbo generally delivers the best-tailored content among LLMs; prompting strategies and CGO topic selection are shown to significantly influence performance. The work provides actionable recommendations for selecting CGOs, optimizing prompts (favoring Health Expert roles and non-CoT prompts), and highlights ethical considerations and safety issues in deploying CGO-tailored vaccine messaging at scale, supported by expert validation and limitations discussion.

Abstract

One way to personalize chatbot interactions is by establishing common ground with the intended reader. A domain where establishing mutual understanding could be particularly impactful is vaccine concerns and misinformation. Vaccine interventions are forms of messaging which aim to answer concerns expressed about vaccination. Tailoring responses in this domain is difficult, since opinions often have seemingly little ideological overlap. We define the task of tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring responses to a CGO involves meaningfully improving the answer by relating it to an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a dataset for evaluating how well responses are tailored to provided CGOs. We benchmark several major LLMs on this task; finding GPT-4-Turbo performs significantly better than others. We also build automatic evaluation metrics, including an efficient and accurate BERT model that outperforms finetuned LLMs, investigate how to successfully tailor vaccine messaging to CGOs, and provide actionable recommendations from this investigation. Code and model weights: https://github.com/rickardstureborg/tailor-cgo Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo

Tailoring Vaccine Messaging with Common-Ground Opinions

TL;DR

Abstract

Paper Structure (42 sections, 15 figures, 3 tables)

This paper contains 42 sections, 15 figures, 3 tables.

Introduction
Related Work
Tailor-CGO Dataset Creation
Task Definition
Concerns and Opinions Statements
Response Generation
Models
Prompting
Human Annotation
Anotation Scheme
Annotator Selection
Inter-annotator Agreement
Automatic Evaluation
Zero-shot Prompting
Fine-tuning
...and 27 more sections

Figures (15)

Figure 1: Example of a Tailored Response to answer a Vaccine Concern while Tailoring to a Common-Ground Opinion. The response above is a shortened version of a model response for the task of tailoring to CGOs. The responses is able to relate two seemingly unrelated topics: side effects from vaccination and a strong support for the military. It is strengthened by language and analogies that may appeal to the user without becoming manipulative. This work creates an evaluation framework and benchmarks different LLMs on their ability to generate such tailored responses.
Figure 2: Tailor-CGO dataset partition sizes. Colors indicate which train/dev/test split each partition is included in. Green = train, Yellow = dev, Blue = test. Relative preferences are collected by asking which of two responses is better tailored, while absolute scoring asks for a 1-5 score for a single response. Both Dev and Test sets (Yellow and Blue) contain 3 independently collected annotations per input response, represented by 3 stacked boxes. The training set (Green) contains just one annotation per response to maximize diversity.
Figure 3: Heatmap of mean scores by LLM evaluation for responses answering a concern (horizontal axis) while tailoring to a CGO (vertical axis). Brighter colors indicate higher scores, while white squares are nulls that were not sampled during annotation. Religion, while an opinion topic that scores poorly in our testing, seems to provide useful opinions for tailoring when focusing on the Direct transmission concern (see \ref{['apx:religion_for_direct_transmission']} for an example output).
Figure 4: Comparison of mean response quality for each CGO, aggregated by topic. Notice that potentially controversial and problematic topics such as discrimination, race, or religion are bad targets for tailoring. The implications of this result is that using divisive topics to establish common-ground may be less useful, and using less polarized topics (self-perception) for example can result in stronger overall scores.
Figure 5: Comparison of Mean Response Quality by each Model in the LLM-annotated train set partition. All differences in the figure are statistically significant. Confidence intervals are computed through bootstrap sampling. Each model is evaluated across approximately 4,000 generated responses each to randomly sampled concern and opinion statements. We see GPT-4-Turbo produces the best tailored responses on average, just ahead of GPT-4. Open-source models still lag far behind, despite using the largest possible model sizes on our hardware.
...and 10 more figures

Tailoring Vaccine Messaging with Common-Ground Opinions

TL;DR

Abstract

Tailoring Vaccine Messaging with Common-Ground Opinions

Authors

TL;DR

Abstract

Table of Contents

Figures (15)