Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs
Serina Chang, Adam Fourney, Eric Horvitz
TL;DR
This study tackles the lack of timely, granular vaccination data by harnessing large-scale web search logs and machine learning to infer vaccine intent and holdout concerns. It builds a vaccine-intent classifier from billions of Bing signals, achieving state-level correlations with CDC data ($r=0.86$) and near real-time, ZIP-code-level estimates, aided by a three-step URL identification pipeline (regex seeds, S-PPR, and graph neural networks). It also constructs a 25,000-URL ontology of vaccine concerns, revealing that holdouts favor untrusted-news sources and engage with distinct concern categories that differ by demographics, yet converge toward early-adopter patterns as they approach vaccine intent. The work provides fine-grained, bias-corrected signals suitable for policy design, such as targeting interventions at under-vaccinated communities and tailoring messaging to observed concerns. By releasing the vaccine-intent estimates and ontology, the authors offer public health practitioners and researchers a reproducible, scalable toolset to monitor vaccination dynamics in real time and to study information exposure and belief formation in the digital age.
Abstract
To design effective vaccine policies, policymakers need detailed data about who has been vaccinated, who is holding out, and why. However, existing data in the US are insufficient: reported vaccination rates are often delayed or missing, and surveys of vaccine hesitancy are limited by high-level questions and self-report biases. Here, we show how large-scale search engine logs and machine learning can be leveraged to fill these gaps and provide novel insights about vaccine intentions and behaviors. First, we develop a vaccine intent classifier that can accurately detect when a user is seeking the COVID-19 vaccine on search. Our classifier demonstrates strong agreement with CDC vaccination rates, with correlations above 0.86, and estimates vaccine intent rates to the level of ZIP codes in real time, allowing us to pinpoint more granular trends in vaccine seeking across regions, demographics, and time. To investigate vaccine hesitancy, we use our classifier to identify two groups, vaccine early adopters and vaccine holdouts. We find that holdouts, compared to early adopters matched on covariates, are 69% more likely to click on untrusted news sites. Furthermore, we organize 25,000 vaccine-related URLs into a hierarchical ontology of vaccine concerns, and we find that holdouts are far more concerned about vaccine requirements, vaccine development and approval, and vaccine myths, and even within holdouts, concerns vary significantly across demographic groups. Finally, we explore the temporal dynamics of vaccine concerns and vaccine seeking, and find that key indicators emerge when individuals convert from holding out to preparing to accept the vaccine.
