Population synthesis with geographic coordinates
Jacopo Lenti, Lorenzo Costantini, Ariadna Fosch, Anna Monticelli, David Scala, Marco Pangallo
TL;DR
This work addresses the challenge of generating geolocated synthetic populations with coordinates while preserving data privacy. It introduces a NF+VAE generative framework that first maps geographic coordinates into a latent space with Normalizing Flows and then learns a joint spatial–non-spatial representation with a Variational Autoencoder to produce synthetic populations. The authors propose an evaluation protocol across fidelity, utility, and privacy, and demonstrate superior performance of NF+VAE over baselines on large, real-world datasets, including mortgage data and Airbnb listings. The approach enables fine-grained, privacy-preserving geolocated data for ABMs in flood risk, epidemic spread, evacuation planning, and transport modeling, offering a scalable and adaptable alternative to coarse geographic aggregates.
Abstract
It is increasingly important to generate synthetic populations with explicit coordinates rather than coarse geographic areas, yet no established methods exist to achieve this. One reason is that latitude and longitude differ from other continuous variables, exhibiting large empty spaces and highly uneven densities. To address this, we propose a population synthesis algorithm that first maps spatial coordinates into a more regular latent space using Normalizing Flows (NF), and then combines them with other features in a Variational Autoencoder (VAE) to generate synthetic populations. This approach also learns the joint distribution between spatial and non-spatial features, exploiting spatial autocorrelations. We demonstrate the method by generating synthetic homes with the same statistical properties of real homes in 121 datasets, corresponding to diverse geographies. We further propose an evaluation framework that measures both spatial accuracy and practical utility, while ensuring privacy preservation. Our results show that the NF+VAE architecture outperforms popular benchmarks, including copula-based methods and uniform allocation within geographic areas. The ability to generate geolocated synthetic populations at fine spatial resolution opens the door to applications requiring detailed geography, from household responses to floods, to epidemic spread, evacuation planning, and transport modeling.
