Table of Contents
Fetching ...

Population synthesis with geographic coordinates

Jacopo Lenti, Lorenzo Costantini, Ariadna Fosch, Anna Monticelli, David Scala, Marco Pangallo

TL;DR

This work addresses the challenge of generating geolocated synthetic populations with coordinates while preserving data privacy. It introduces a NF+VAE generative framework that first maps geographic coordinates into a latent space with Normalizing Flows and then learns a joint spatial–non-spatial representation with a Variational Autoencoder to produce synthetic populations. The authors propose an evaluation protocol across fidelity, utility, and privacy, and demonstrate superior performance of NF+VAE over baselines on large, real-world datasets, including mortgage data and Airbnb listings. The approach enables fine-grained, privacy-preserving geolocated data for ABMs in flood risk, epidemic spread, evacuation planning, and transport modeling, offering a scalable and adaptable alternative to coarse geographic aggregates.

Abstract

It is increasingly important to generate synthetic populations with explicit coordinates rather than coarse geographic areas, yet no established methods exist to achieve this. One reason is that latitude and longitude differ from other continuous variables, exhibiting large empty spaces and highly uneven densities. To address this, we propose a population synthesis algorithm that first maps spatial coordinates into a more regular latent space using Normalizing Flows (NF), and then combines them with other features in a Variational Autoencoder (VAE) to generate synthetic populations. This approach also learns the joint distribution between spatial and non-spatial features, exploiting spatial autocorrelations. We demonstrate the method by generating synthetic homes with the same statistical properties of real homes in 121 datasets, corresponding to diverse geographies. We further propose an evaluation framework that measures both spatial accuracy and practical utility, while ensuring privacy preservation. Our results show that the NF+VAE architecture outperforms popular benchmarks, including copula-based methods and uniform allocation within geographic areas. The ability to generate geolocated synthetic populations at fine spatial resolution opens the door to applications requiring detailed geography, from household responses to floods, to epidemic spread, evacuation planning, and transport modeling.

Population synthesis with geographic coordinates

TL;DR

This work addresses the challenge of generating geolocated synthetic populations with coordinates while preserving data privacy. It introduces a NF+VAE generative framework that first maps geographic coordinates into a latent space with Normalizing Flows and then learns a joint spatial–non-spatial representation with a Variational Autoencoder to produce synthetic populations. The authors propose an evaluation protocol across fidelity, utility, and privacy, and demonstrate superior performance of NF+VAE over baselines on large, real-world datasets, including mortgage data and Airbnb listings. The approach enables fine-grained, privacy-preserving geolocated data for ABMs in flood risk, epidemic spread, evacuation planning, and transport modeling, offering a scalable and adaptable alternative to coarse geographic aggregates.

Abstract

It is increasingly important to generate synthetic populations with explicit coordinates rather than coarse geographic areas, yet no established methods exist to achieve this. One reason is that latitude and longitude differ from other continuous variables, exhibiting large empty spaces and highly uneven densities. To address this, we propose a population synthesis algorithm that first maps spatial coordinates into a more regular latent space using Normalizing Flows (NF), and then combines them with other features in a Variational Autoencoder (VAE) to generate synthetic populations. This approach also learns the joint distribution between spatial and non-spatial features, exploiting spatial autocorrelations. We demonstrate the method by generating synthetic homes with the same statistical properties of real homes in 121 datasets, corresponding to diverse geographies. We further propose an evaluation framework that measures both spatial accuracy and practical utility, while ensuring privacy preservation. Our results show that the NF+VAE architecture outperforms popular benchmarks, including copula-based methods and uniform allocation within geographic areas. The ability to generate geolocated synthetic populations at fine spatial resolution opens the door to applications requiring detailed geography, from household responses to floods, to epidemic spread, evacuation planning, and transport modeling.

Paper Structure

This paper contains 9 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: Overview of the proposed population synthesis generation approach. The real geolocated population is given as input to the generator. Normalizing Flows are trained to map the real geographic coordinates to a simple latent space. Together with all other home features, these latent coordinates are used to train a Variational Autoencoder. Finally the Variational Autoencoder samples synthetic populations that resemble the input data. Left and right panels compare a random sample of 1,000 real and synthetic homes (respectively) in the province of Turin (gray lines). Synthetic data reproduces real patterns, with higher presence of garage in the outskirts of Turin city (black lines in the maps), and lower presence in the city center.
  • Figure 2: Description of the evaluation framework, based on fidelity, utility, and privacy. Fidelity measures (i) the similarity of the distribution of geographic coordinates, (ii) the similarity of the spatial autocorrelations, and (iii) the similarity of the houses generated in each grid cells. Utility assesses the quality of a model trained on synthetic data in predicting the house prices in real data. Privacy measures the robustness against membership inference attacks.
  • Figure 3: Real and synthetic homes generated by the benchmark generators in the province of Turin. In this plot, for each map we show a random sample of 1,000 homes, colored by the presence of garage.
  • Figure 4: Distributions of evaluation metrics in data$\_$isp. (a) Fidelity - Geographic coordinates, i.e., sliced-Wasserstein distance geographic coordinates, (b) Similarity - Spatial autocorrelation, distance between spatial autocorrelations in the PCs of real and synthetic homes, (c) Fidelity - Local features, distance between average home per spatial grid cell, (d) Utility, distance between $R^2$ in predicting real log-price with a model trained with real and synthetic data, (e) Privacy, difference between AUC-ROC of a classifier trained to infer the membership in the original dataset. Best performances are close to 0 in all methods. (a), (b), (c), and (d) are always positive, (e) can be negative. Detailed statistics of this figure are available in SM.