Table of Contents
Fetching ...

Urban Boundary Delineation from Commuting Data with Bayesian Stochastic Blockmodeling: Scale, Contiguity, and Hierarchy

Sebastian Morel-Balbi, Alec Kirkley

TL;DR

The paper analyzes urban boundary delineation from commuting data using stochastic block models (SBMs) and the minimum description length (MDL) principle to achieve principled network partitioning without tunable parameters. It compares microcanonical SBM variants across directed, weighted, and multigraph representations, and introduces a fast greedy agglomerative regionalization to enforce spatial contiguity while preserving compression. Results show weighted SBMs, especially nested variants, yield strong data compression across scales, but standard SBMs often produce discontiguous regions; the greedy method delivers contiguous partitions with comparable MDL performance. At tract and county levels, the approach reveals scale-dependent trade-offs between contiguity, interpretability, and compression, with weighted models generally outperforming multigraphs and counties capturing substantial but not optimal structure. The work provides practical guidelines for selecting network representations and SBM variants and demonstrates a flexible, scalable tool for data-driven urban boundary delineation with broad applicability to mobility networks.

Abstract

A common method for delineating urban and suburban boundaries is to identify clusters of spatial units that are highly interconnected in a network of commuting flows, each cluster signaling a cohesive economic submarket. It is critical that the clustering methods employed for this task are principled and free of unnecessary tunable parameters to avoid unwanted inductive biases while remaining scalable for high resolution mobility networks. Here we systematically assess the benefits and limitations of a wide array of Stochastic Block Models (SBMs)$\unicode{x2014}$a family of principled, nonparametric models for identifying clusters in networks$\unicode{x2014}$for delineating urban spatial boundaries with commuting data. We find that the data compression capability and relative performance of different SBM variants heavily depends on the spatial extent of the commuting network, its aggregation scale, and the method used for weighting network edges. We also construct a new measure to assess the degree to which community detection algorithms find spatially contiguous partitions, finding that traditional SBMs may produce substantial spatial discontiguities that make them challenging to use in general for urban boundary delineation. We propose a fast nonparametric regionalization algorithm that can alleviate this issue, achieving data compression close to that of unconstrained SBM models while ensuring spatial contiguity, benefiting from a deterministic optimization procedure, and being generalizable to a wide range of community detection objective functions.

Urban Boundary Delineation from Commuting Data with Bayesian Stochastic Blockmodeling: Scale, Contiguity, and Hierarchy

TL;DR

The paper analyzes urban boundary delineation from commuting data using stochastic block models (SBMs) and the minimum description length (MDL) principle to achieve principled network partitioning without tunable parameters. It compares microcanonical SBM variants across directed, weighted, and multigraph representations, and introduces a fast greedy agglomerative regionalization to enforce spatial contiguity while preserving compression. Results show weighted SBMs, especially nested variants, yield strong data compression across scales, but standard SBMs often produce discontiguous regions; the greedy method delivers contiguous partitions with comparable MDL performance. At tract and county levels, the approach reveals scale-dependent trade-offs between contiguity, interpretability, and compression, with weighted models generally outperforming multigraphs and counties capturing substantial but not optimal structure. The work provides practical guidelines for selecting network representations and SBM variants and demonstrates a flexible, scalable tool for data-driven urban boundary delineation with broad applicability to mobility networks.

Abstract

A common method for delineating urban and suburban boundaries is to identify clusters of spatial units that are highly interconnected in a network of commuting flows, each cluster signaling a cohesive economic submarket. It is critical that the clustering methods employed for this task are principled and free of unnecessary tunable parameters to avoid unwanted inductive biases while remaining scalable for high resolution mobility networks. Here we systematically assess the benefits and limitations of a wide array of Stochastic Block Models (SBMs)a family of principled, nonparametric models for identifying clusters in networksfor delineating urban spatial boundaries with commuting data. We find that the data compression capability and relative performance of different SBM variants heavily depends on the spatial extent of the commuting network, its aggregation scale, and the method used for weighting network edges. We also construct a new measure to assess the degree to which community detection algorithms find spatially contiguous partitions, finding that traditional SBMs may produce substantial spatial discontiguities that make them challenging to use in general for urban boundary delineation. We propose a fast nonparametric regionalization algorithm that can alleviate this issue, achieving data compression close to that of unconstrained SBM models while ensuring spatial contiguity, benefiting from a deterministic optimization procedure, and being generalizable to a wide range of community detection objective functions.
Paper Structure (15 sections, 22 equations, 9 figures)

This paper contains 15 sections, 22 equations, 9 figures.

Figures (9)

  • Figure 1: Illustration of the methodologies described in the text. (a) The original problem formulation: given a network of commute flows in a metropolitan area, infer a regionalization that captures cohesive commute patterns. (b) Results obtained by minimizing the description length of the weighted SBM (Eq. \ref{['eq:Wdl']}) with simulated annealing. (c) Results obtained by minimizing the same description length with the proposed greedy regionalization algorithm described in Sec. \ref{['sec:greedy']}. (d) Illustration of the optimization trajectory of simulated annealing. (e) Illustration of the optimization trajectory of the greedy agglomerative method, with the description length obtained by MCMC indicated by the dashed line.
  • Figure 2: Difference in description length between the best-performing multigraph model and the best-performing weighted model across all metropolitan areas as a function of $N$. Positive values of the difference indicate that weighted variants tend to outperform multigraph ones across all metropolitan areas, and the difference in performance grows larger as the number of nodes in the network increases.
  • Figure 3: (a) Compression ratio for the best performing model as a function of $N$ across all metropolitan areas considered. The dashed horizontal line corresponds to the performance of the null Erdos-Renyi style model (Eq. \ref{['eq:er_dl']}). We can observe that as the number of nodes $N$ increases, SBM models achieve increasing compression with respect to the null model. (b) Histogram of the best-performing model across the statistical classifications of the metropolitan areas. While there is more diversity among the best-performing models for CBSAs, as we move from CSA's to States the weighted nested variant of the SBM becomes dominant due to the increasing size $N$. (c) Description length ratio between the regionalizations obtained via the greedy agglomerative method and the best-performing SBM model. While SBMs always outperform the greedy method in terms of compression, the differences are quite modest even for large networks. (d) Number of groups inferred by the models as a function of $N$, with regression lines drawn as a visual guide. Nested models tend to identify more groups than non-nested models. (e) Comparison of the number of groups inferred by the greedy agglomerative method and the standard wSBM, the black dashed line indicating the case $B_{GA} = B_{wSBM}$. Both methods infer a similar number of groups, with the greedy method consistently finding slightly fewer. (f) Adjusted mutual information (AMI) between the partitions inferred with the best-performing SBM model and the greedy algorithm across all metropolitan areas considered as a function of the difference in description length (normalized by network size). The sizes of the points are proportional to $N$.
  • Figure 4: (a) Sample regionalizations inferred via the wSBM for Ocean City, NJ, and Kalamazoo-Portage, MI. In both cases, metropolitan areas are divided into $B = 5$ groups, $B^* = 4$ of which are discontiguous. (b) The fraction of discontiguous groups as a function of the number of nodes $N$ across all the metropolitan areas. (c) The average number of components comprising each group as a function of $N$ across all the metropolitan areas. (d) The average component size as a function of the average number of components making up each group across all metropolitan areas. (e) The contiguity violation measure (CVM) (Eq. \ref{['eq:cvm']}) as a function of the fraction of discontiguous groups across all the metropolitan areas. The sizes of the points are proportional to $N$ in panels (d) and (e).
  • Figure 5: (a) The spatial normalized mutual information (sNMI) score (Eq. \ref{['eq:snmi']}) between the partitions inferred via the greedy agglomerative algorithm and those induced by the administrative county subdivisions, for all the lower 48 states in the U.S.A., as a function of the difference in description length $\Sigma^{(GA)} - \Sigma^{(county)}$ normalized by the number of nodes $N$. The sizes of the points is proportional to $N$. For all states, the agglomerative was stopped when the number of inferred groups $B_{GA}$ was equal to the number of counties in each state, $B_{GA} = B_{county}$, for a more direct comparison of their partitions. Red points indicate states where the county subdivision has a lower description length than the one inferred by the greedy algorithm due to the artificial constraint $B_{GA} = B_{county}$ which may lie far from the true optimal $B$. (b) Boxplots of the distributions of sNMI scores between the partitions inferred by the greedy algorithm and $100$ random Voronoi tessellations with a number of seeds equal to $B_{GA}=B_{county}$. The original sNMI scores between the greedy solution and the administrative county subdivisions are indicated by a star; blue stars indicate county partitions whose sNMI scores with the corresponding greedy solution lie above the 95th percentile of the boxplot distributions, and red stars indicate solutions that lie below. (c) Distribution of sNMI scores between the partition inferred via the greedy algorithm (GA) and $100$ random Voronoi tessellations for the state of New York. The vertical dashed red line indicates the sNMI score between the original administrative county subdivision and the GA partition. (d) Regionalization results for the state of New York inferred via the greedy algorithm. (e) Administrative county subdivision of New York state. (f) A sample Voronoi tessellation of the state of New York. In all cases, red circles indicate the number of regions in the New York City area. As can be observed, the greedy algorithm uses the majority of the inferred groups to describe this region, as would be expected based on its high level of commuting heterogeneity and high tract density.
  • ...and 4 more figures