Table of Contents
Fetching ...

Community search signatures as foundation features for human-centered geospatial modeling

Mimi Sun, Chaitanya Kamath, Mohit Agarwal, Arbaaz Muslim, Hector Yee, David Schottlander, Shailesh Bavadekar, Niv Efron, Shravya Shetty, Gautam Prasad

TL;DR

This work proposes a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling and demonstrates that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

Abstract

Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used in time series modeling across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In zip codes with a population greater than 3000 that cover over 95% of the contiguous US population, our models for predicting missing values in a 20% set of holdout counties achieve an average $R^2$ score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

Community search signatures as foundation features for human-centered geospatial modeling

TL;DR

This work proposes a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling and demonstrates that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

Abstract

Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used in time series modeling across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In zip codes with a population greater than 3000 that cover over 95% of the contiguous US population, our models for predicting missing values in a 20% set of holdout counties achieve an average score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

Paper Structure

This paper contains 12 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: First row shows the values of "feature1" in CONUS and Cook County, IL. The second row shows "feature8" in the same locations.
  • Figure 2: Map showing the dataset split. Yellow areas are counties where all contained zip codes are in the holdout set, green areas are used for training and validation.
  • Figure 3: Actual, predicted, and test set scatter plot of 6 variables. The "predicted" column shows a concatenation of predictions for the 5 validation sets and the test set, made by six different models. The scatter plot shows the test set performance. Training set predictions are not displayed.
  • Figure 4: Performance vs training data %
  • Figure 5: Performance vs feature dimension
  • ...and 3 more figures