Table of Contents
Fetching ...

Bayesian nonparametric boundary detection for multiple areal data

Matteo Gianella, Mario Beraha, Alessandra Guglielmi

Abstract

We consider the problem of boundary detection for areal data, focusing on situations where for each areal unit multiple observations are available. We propose a Bayesian nonparametric mixture model for the area-specific population densities, with spatially dependent weights and a random number of components. Contrary to previously proposed methods for boundary detection, which consider one observation per areal unit, ours does not require external information such as area-specific covariates or dissimilarity metrics. Instead, by exploiting information from multiple samples per area, it is able to identify boundaries between areas that exhibit different densities. Crucially, the number of mixture components needs to be learned from data to obtain meaningful boundary detection, due to the non-identifiability of overfitted mixtures. Therefore, we assume it random by placing a prior on it. The motivating application is the analysis of economic inequality in the greater Los Angeles region, which typically yields social inequality and unrest. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler which exploits the recently introduced \emph{optimal auxiliary priors} to improve the mixing. The methodology is validated via extensive simulations and applied to the income data in the greater Los Angeles region. We identify several boundaries in the income distributions, which can be explained \textit{ex-post} in terms of the percentage of the population without health insurance, though not in terms of the total number of crimes, showing the usefulness of such an analysis to policymakers.

Bayesian nonparametric boundary detection for multiple areal data

Abstract

We consider the problem of boundary detection for areal data, focusing on situations where for each areal unit multiple observations are available. We propose a Bayesian nonparametric mixture model for the area-specific population densities, with spatially dependent weights and a random number of components. Contrary to previously proposed methods for boundary detection, which consider one observation per areal unit, ours does not require external information such as area-specific covariates or dissimilarity metrics. Instead, by exploiting information from multiple samples per area, it is able to identify boundaries between areas that exhibit different densities. Crucially, the number of mixture components needs to be learned from data to obtain meaningful boundary detection, due to the non-identifiability of overfitted mixtures. Therefore, we assume it random by placing a prior on it. The motivating application is the analysis of economic inequality in the greater Los Angeles region, which typically yields social inequality and unrest. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler which exploits the recently introduced \emph{optimal auxiliary priors} to improve the mixing. The methodology is validated via extensive simulations and applied to the income data in the greater Los Angeles region. We identify several boundaries in the income distributions, which can be explained \textit{ex-post} in terms of the percentage of the population without health insurance, though not in terms of the total number of crimes, showing the usefulness of such an analysis to policymakers.
Paper Structure (32 sections, 28 equations, 28 figures, 11 tables, 1 algorithm)

This paper contains 32 sections, 28 equations, 28 figures, 11 tables, 1 algorithm.

Figures (28)

  • Figure 1: California census income data in the log scale. Each area is coloured according to the empirical mean (left) and variance (right) of the log-income.
  • Figure 2: Example of non-identifiability with overfitted mixtures. The black dashed line is $f_0$. The blue and orange lines are the mixtures $\widetilde{f}(\bm{w}_1)$ and $\widetilde{f}(\bm{w}_2)$ of six Gaussian kernels.
  • Figure 3: Simulation from spatially dependent weights: (a) and (b) shows the values of $w_{i,1}$ and $w_{i,2}$ for each area. (c) represent the adjacency graph, where orange squares denote couples of geographically contiguous areas.
  • Figure 4: Posterior inference on the simulated dataset from spatially dependent weights under default parameters: (a) Posterior distribution of $H$; (b) Traceplot of $H$; (c -- d) Comparison between the true (blue line) and estimated densities (orange line) in two areas. The orange ribbon represents the $95\%$ credibility band for the estimated densities.
  • Figure 5: Posterior inference for the simulated dataset of \ref{['main:subsec:bd_simstudy']}: (a) shows the spatial grid labelled according to the true data generating densities with the estimated boundary edges highlighted in red on the map; (b) and (c) report posterior estimated densities in two boundary areas, namely area 3 and area 4. The orange band represents the 95% credible interval for the estimated density.
  • ...and 23 more figures