Table of Contents
Fetching ...

Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs

Serina Chang, Adam Fourney, Eric Horvitz

TL;DR

This study tackles the lack of timely, granular vaccination data by harnessing large-scale web search logs and machine learning to infer vaccine intent and holdout concerns. It builds a vaccine-intent classifier from billions of Bing signals, achieving state-level correlations with CDC data ($r=0.86$) and near real-time, ZIP-code-level estimates, aided by a three-step URL identification pipeline (regex seeds, S-PPR, and graph neural networks). It also constructs a 25,000-URL ontology of vaccine concerns, revealing that holdouts favor untrusted-news sources and engage with distinct concern categories that differ by demographics, yet converge toward early-adopter patterns as they approach vaccine intent. The work provides fine-grained, bias-corrected signals suitable for policy design, such as targeting interventions at under-vaccinated communities and tailoring messaging to observed concerns. By releasing the vaccine-intent estimates and ontology, the authors offer public health practitioners and researchers a reproducible, scalable toolset to monitor vaccination dynamics in real time and to study information exposure and belief formation in the digital age.

Abstract

To design effective vaccine policies, policymakers need detailed data about who has been vaccinated, who is holding out, and why. However, existing data in the US are insufficient: reported vaccination rates are often delayed or missing, and surveys of vaccine hesitancy are limited by high-level questions and self-report biases. Here, we show how large-scale search engine logs and machine learning can be leveraged to fill these gaps and provide novel insights about vaccine intentions and behaviors. First, we develop a vaccine intent classifier that can accurately detect when a user is seeking the COVID-19 vaccine on search. Our classifier demonstrates strong agreement with CDC vaccination rates, with correlations above 0.86, and estimates vaccine intent rates to the level of ZIP codes in real time, allowing us to pinpoint more granular trends in vaccine seeking across regions, demographics, and time. To investigate vaccine hesitancy, we use our classifier to identify two groups, vaccine early adopters and vaccine holdouts. We find that holdouts, compared to early adopters matched on covariates, are 69% more likely to click on untrusted news sites. Furthermore, we organize 25,000 vaccine-related URLs into a hierarchical ontology of vaccine concerns, and we find that holdouts are far more concerned about vaccine requirements, vaccine development and approval, and vaccine myths, and even within holdouts, concerns vary significantly across demographic groups. Finally, we explore the temporal dynamics of vaccine concerns and vaccine seeking, and find that key indicators emerge when individuals convert from holding out to preparing to accept the vaccine.

Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs

TL;DR

This study tackles the lack of timely, granular vaccination data by harnessing large-scale web search logs and machine learning to infer vaccine intent and holdout concerns. It builds a vaccine-intent classifier from billions of Bing signals, achieving state-level correlations with CDC data () and near real-time, ZIP-code-level estimates, aided by a three-step URL identification pipeline (regex seeds, S-PPR, and graph neural networks). It also constructs a 25,000-URL ontology of vaccine concerns, revealing that holdouts favor untrusted-news sources and engage with distinct concern categories that differ by demographics, yet converge toward early-adopter patterns as they approach vaccine intent. The work provides fine-grained, bias-corrected signals suitable for policy design, such as targeting interventions at under-vaccinated communities and tailoring messaging to observed concerns. By releasing the vaccine-intent estimates and ontology, the authors offer public health practitioners and researchers a reproducible, scalable toolset to monitor vaccination dynamics in real time and to study information exposure and belief formation in the digital age.

Abstract

To design effective vaccine policies, policymakers need detailed data about who has been vaccinated, who is holding out, and why. However, existing data in the US are insufficient: reported vaccination rates are often delayed or missing, and surveys of vaccine hesitancy are limited by high-level questions and self-report biases. Here, we show how large-scale search engine logs and machine learning can be leveraged to fill these gaps and provide novel insights about vaccine intentions and behaviors. First, we develop a vaccine intent classifier that can accurately detect when a user is seeking the COVID-19 vaccine on search. Our classifier demonstrates strong agreement with CDC vaccination rates, with correlations above 0.86, and estimates vaccine intent rates to the level of ZIP codes in real time, allowing us to pinpoint more granular trends in vaccine seeking across regions, demographics, and time. To investigate vaccine hesitancy, we use our classifier to identify two groups, vaccine early adopters and vaccine holdouts. We find that holdouts, compared to early adopters matched on covariates, are 69% more likely to click on untrusted news sites. Furthermore, we organize 25,000 vaccine-related URLs into a hierarchical ontology of vaccine concerns, and we find that holdouts are far more concerned about vaccine requirements, vaccine development and approval, and vaccine myths, and even within holdouts, concerns vary significantly across demographic groups. Finally, we explore the temporal dynamics of vaccine concerns and vaccine seeking, and find that key indicators emerge when individuals convert from holding out to preparing to accept the vaccine.
Paper Structure (62 sections, 6 equations, 20 figures, 7 tables)

This paper contains 62 sections, 6 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Vaccine intent classifier. (a) Our computational approach centers on query-click graphs constructed from billions of Bing search logs. (b) Using these graphs, we introduce a three-step pipeline to identify vaccine intent URLs: generate URL candidates via Personalized PageRank; present URL candidates to annotators; and expand the final set of URLs with graph neural networks. Each step improves our coverage of users and correlation with CDC vaccination rates (Table \ref{['tab:pipeline-results']}). (c) Our vaccine intent estimates are highly correlated with state vaccination rates from the CDC. Here, we compare cumulative rates up to August 31, 2021 ($r=0.86$). (d) Our estimates are also highly correlated with CDC rates over time ($r=0.89$, median over states), with the CDC time series lagging by 7-15 days (IQR). Here, we visualize time series for the 4 largest states in the US, with extended results in Section \ref{['sec:methods-cdc']}.
  • Figure 2: Granular trends in vaccine seeking. (a) Using our classifier, we can estimate vaccine intent rates per ZCTA, approximately 10x the granularity of counties. (b) Zooming in on New York City shows that estimated vaccine intent rates vary substantially across ZCTAs, even within the same city or county. (c) We measure correlations between ZCTA vaccine intent rates and demographic variables to characterize demographic trends in vaccination.
  • Figure 3: Our ontology of vaccine concerns consists of 8 top categories and 36 subcategories.
  • Figure 4: Vaccine concerns and news consumption. In all subfigures, news/categories are colored from yellow to dark purple to represent most holdout-leaning to most early adopter-leaning. (a) The lower the trust rating from Newsguard, the likelier it is that vaccine holdouts click on the news site, relative to early adopters. (b) Holdouts' top category concerns include Vaccine Safety, Requirements, and Information, with varying proportions over time. (c) Comparing holdouts vs. early adopters' relative probabilities of clicking on each subcategory (from April to June 2021) reveals each group's distinctive concerns. (d) Near when holdouts express vaccine intent ($\pm$3 days) in July and August 2021, their concerns become much more like the concerns of early adopters, with a few important differences.
  • Figure 5: First question for Amazon Mechanical Turk task.
  • ...and 15 more figures