Table of Contents
Fetching ...

A Bayesian Spatial Model to Correct Under-Reporting in Urban Crowdsourcing

Gabriel Agostini, Emma Pierson, Nikhil Garg

TL;DR

This work tackles under-reporting in urban crowdsourcing by introducing a Bayesian spatial latent-variable model that infers true event occurrence probabilities $\Pr(A_i=1)$ from positive-unlabeled data $T_i$, incorporating demographic heterogeneity through a reporting rate $\psi_i$ and spatial dependence via an Ising prior with parameters $\theta_0$ and $\theta_1$. It distinguishes ground truth from non-reported events by leveraging spatial correlations and uses both homogeneous and heterogeneous reporting specifications, with inference via Gibbs sampling and SVEA to handle intractable normalizing constants. The authors validate the approach with semi-synthetic experiments and apply it to NYC 311 flood reports after Ida, showing improved prediction of future reports and more equity-aware inspection allocations; pooling results across storms reveals demographic patterns in reporting behavior. The framework yields practical benefits for proactive governance, enabling faster, fairer resource allocation and suggesting broader applicability to other spatially correlated, under-reported urban phenomena beyond flooding.

Abstract

Decision-makers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur--a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are spatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white, less traditionally educated, and lower-income residents. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.

A Bayesian Spatial Model to Correct Under-Reporting in Urban Crowdsourcing

TL;DR

This work tackles under-reporting in urban crowdsourcing by introducing a Bayesian spatial latent-variable model that infers true event occurrence probabilities from positive-unlabeled data , incorporating demographic heterogeneity through a reporting rate and spatial dependence via an Ising prior with parameters and . It distinguishes ground truth from non-reported events by leveraging spatial correlations and uses both homogeneous and heterogeneous reporting specifications, with inference via Gibbs sampling and SVEA to handle intractable normalizing constants. The authors validate the approach with semi-synthetic experiments and apply it to NYC 311 flood reports after Ida, showing improved prediction of future reports and more equity-aware inspection allocations; pooling results across storms reveals demographic patterns in reporting behavior. The framework yields practical benefits for proactive governance, enabling faster, fairer resource allocation and suggesting broader applicability to other spatially correlated, under-reported urban phenomena beyond flooding.

Abstract

Decision-makers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur--a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are spatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white, less traditionally educated, and lower-income residents. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.
Paper Structure (37 sections, 10 equations, 18 figures, 5 tables)

This paper contains 37 sections, 10 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Model-inferred probabilities $\Pr(A_i)$ that each New York City Census tract is flooded after Hurricane Ida, from the heterogeneous reporting model. Hatched lines indicate tracts that reported during the training period.
  • Figure 2: Demographic disparities when allocating resources to 100 census tracts (among those that do not report), using inferred flood probabilities from the four models. The horizontal axes shows the proportion of all residents served by the inspections (i.e. those who reside in the $100$ inspected census tracts) who are non-white and do not have a high school degree, computed as a weighted average from the proportions on inspected tracts. Dashed lines represent the total proportion of residents in tracts without a report who are non-white and do not have a high school degree.
  • Figure 3: Model-inferred report rates $\psi_i$ per census tract, from the heterogeneous reporting model. The report rates range from near $0.1$ to $0.9$. Weighted averages of report rates per racial composition are shown in a bar plot.
  • Figure 4: Estimated multivariate coefficients after pooling the three storms. Features were all standardized. Confidence intervals shown, and estimates with insignificant non-zero association colored in grey.
  • Figure 5: Estimated coefficients for the model trained in each storm individually. For the regression component, all features were standardized to have zero mean and unit variance. Confidence intervals are shown, and estimates with insignificant positive or negative associations are colored in grey.
  • ...and 13 more figures