Table of Contents
Fetching ...

Compressing Search with Language Models

Thomas Mulc, Jennifer L. Steele

TL;DR

The paper tackles exploiting vast search signals for real-world forecasting by introducing SLaM Compression, which converts large vocabularies of search terms into compact, semantics-preserving embeddings using pretrained language models, and CoSMo, a Constrained Search Model that predicts outcomes from these embeddings with region-aware structure. SLaM produces a fixed-length embedding per period via γ_t = ∑_{s∈S} v_{s,t} · LM(s) and normalized γ^*_t, enabling efficient aggregation and analysis without manual term filtering; CoSMo combines search volume, a probabilistic mapping P(γ^*, θ, r), and region multipliers to yield calibrated predictions. Empirical results on U.S. automobile sales and influenza-like illness demonstrate substantial gains over traditional Google Trends and linear baselines, with notable improvements in nowcasting accuracy and competitive performance relative to autoregressive methods, while maintaining interpretability at the term level. The work also highlights privacy-preserving aspects and the potential for zero-shot geo-transfer, supported by multilingual embeddings that enhance performance across languages. Overall, SLaM and CoSMo offer a scalable, interpretable, and privacy-conscious framework for transforming text-based search signals into actionable predictive power.

Abstract

Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.

Compressing Search with Language Models

TL;DR

The paper tackles exploiting vast search signals for real-world forecasting by introducing SLaM Compression, which converts large vocabularies of search terms into compact, semantics-preserving embeddings using pretrained language models, and CoSMo, a Constrained Search Model that predicts outcomes from these embeddings with region-aware structure. SLaM produces a fixed-length embedding per period via γ_t = ∑_{s∈S} v_{s,t} · LM(s) and normalized γ^*_t, enabling efficient aggregation and analysis without manual term filtering; CoSMo combines search volume, a probabilistic mapping P(γ^*, θ, r), and region multipliers to yield calibrated predictions. Empirical results on U.S. automobile sales and influenza-like illness demonstrate substantial gains over traditional Google Trends and linear baselines, with notable improvements in nowcasting accuracy and competitive performance relative to autoregressive methods, while maintaining interpretability at the term level. The work also highlights privacy-preserving aspects and the potential for zero-shot geo-transfer, supported by multilingual embeddings that enhance performance across languages. Overall, SLaM and CoSMo offer a scalable, interpretable, and privacy-conscious framework for transforming text-based search signals into actionable predictive power.

Abstract

Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.
Paper Structure (27 sections, 28 equations, 7 figures, 12 tables)

This paper contains 27 sections, 28 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: SLaM inputs all searches during a given time period and compressed them to a fixed-length vector that is effectively a summary of all search terms. (Left) Each search term is passed through a language model that produces a fixed-length vector of size $D$. Colors represent unique search terms while shadings represents different embedding dimensions. (Right) All the $D$-length vectors are passed to the aggregation step, where they are reduced to a single vector, the search embedding, of size $\mathcal{O}(D)$, which is later used as a feature for modeling.
  • Figure 2: Model Structure for the CoSMo model used in all models.
  • Figure 3: National U.S. Flu Modeling plot for Training and Test periods. CoSMo predicted values are the average of 40 trainings with random seeds with the shaded areas represesnting the 95% confidence interval.
  • Figure 4: U.S. Automotive Sales Actuals vs. Predictions. A 4-week rolling average of the model and targets were generated to smooth out spikes typically caused by end-of-month reporting variability. On the test period the model has a .9065 R$^2$ and 3.03 MAPE. The vertical line indicates the beginning of the test period.
  • Figure 5: Visualization of auto search terms and estimated impact from model predictions. Each dot represents a distinct search term, and terms have been clustered based on embedding vectors, and hand-labeled for exposition.
  • ...and 2 more figures