Table of Contents
Fetching ...

Generating Spatial Synthetic Populations Using Wasserstein Generative Adversarial Network: A Case Study with EU-SILC Data for Helsinki and Thessaloniki

Vanja Falck

TL;DR

This study addresses the challenge of creating high-featured spatial synthetic populations for agent-based simulations by leveraging a Wasserstein Generative Adversarial Network (WGAN) trained on EU-SILC microdata from Finland and Greece to model Helsinki and Thessaloniki. It compares two balancing strategies—weight-imputation and WGAN-imputation—using diverse validation metrics, including $SRMSE$, Pearson's $r$, $R^2$, and Bland-Altman plots, to assess statistical, internal, and external validity. The results show that WGAN-based balancing can closely match targeted demographic profiles (e.g., for Helsinki) but may distort fringe groups, particularly for the self-perceived health variable $PH010$, highlighting discrimination risks and the need for careful balancing and validation. The findings underscore the potential of WGANs for producing rich synthetic populations while also calling attention to ethical and methodological challenges in representing vulnerable groups, suggesting future work on robust validity frameworks and advanced generative techniques. Overall, this work advances privacy-preserving, high-dimensional synthetic population generation for urban microsimulation, with practical implications for urban planning, health, and economic forecasting.

Abstract

Using agent-based social simulations can enhance our understanding of urban planning, public health, and economic forecasting. Realistic synthetic populations with numerous attributes strengthen these simulations. The Wasserstein Generative Adversarial Network, trained on census data like EU-SILC, can create robust synthetic populations. These methods, aided by external statistics or EU-SILC weights, generate spatial synthetic populations for agent-based models. The increased access to high-quality micro-data has sparked interest in synthetic populations, which preserve demographic profiles and analytical strength while ensuring privacy and preventing discrimination. This study uses national data from Finland and Greece for Helsinki and Thessaloniki to explore balanced spatial synthetic population generation. Results show challenges related to balancing data with or without aggregated statistics for the target population and the general under-representation of fringe profiles by deep generative methods. The latter can lead to discrimination in agent-based simulations.

Generating Spatial Synthetic Populations Using Wasserstein Generative Adversarial Network: A Case Study with EU-SILC Data for Helsinki and Thessaloniki

TL;DR

This study addresses the challenge of creating high-featured spatial synthetic populations for agent-based simulations by leveraging a Wasserstein Generative Adversarial Network (WGAN) trained on EU-SILC microdata from Finland and Greece to model Helsinki and Thessaloniki. It compares two balancing strategies—weight-imputation and WGAN-imputation—using diverse validation metrics, including , Pearson's , , and Bland-Altman plots, to assess statistical, internal, and external validity. The results show that WGAN-based balancing can closely match targeted demographic profiles (e.g., for Helsinki) but may distort fringe groups, particularly for the self-perceived health variable , highlighting discrimination risks and the need for careful balancing and validation. The findings underscore the potential of WGANs for producing rich synthetic populations while also calling attention to ethical and methodological challenges in representing vulnerable groups, suggesting future work on robust validity frameworks and advanced generative techniques. Overall, this work advances privacy-preserving, high-dimensional synthetic population generation for urban microsimulation, with practical implications for urban planning, health, and economic forecasting.

Abstract

Using agent-based social simulations can enhance our understanding of urban planning, public health, and economic forecasting. Realistic synthetic populations with numerous attributes strengthen these simulations. The Wasserstein Generative Adversarial Network, trained on census data like EU-SILC, can create robust synthetic populations. These methods, aided by external statistics or EU-SILC weights, generate spatial synthetic populations for agent-based models. The increased access to high-quality micro-data has sparked interest in synthetic populations, which preserve demographic profiles and analytical strength while ensuring privacy and preventing discrimination. This study uses national data from Finland and Greece for Helsinki and Thessaloniki to explore balanced spatial synthetic population generation. Results show challenges related to balancing data with or without aggregated statistics for the target population and the general under-representation of fringe profiles by deep generative methods. The latter can lead to discrimination in agent-based simulations.

Paper Structure

This paper contains 16 sections, 8 figures.

Figures (8)

  • Figure 1: Match between single variables in original and synthetic data from Wasserstein generative adversarial network. A comparison is made between variables from EU-SILC Finland in 2022. Figures a) and b) are produced training on weight-balanced complete population data. Figure c) is produced by training on the weight-balanced region, including Helsinki. Figure d) is produced by training on the wgan-imputed region, including Helsinki. Figures e) and f) are produced training on weight-balanced complete population data. Figure g) is produced by training on the weight-balanced region, including Thessaloniki only.
  • Figure 2: Figure (a) shows self-perceived health (PH010) in synthetic data from training on weight-imputed originals for Finland is compared to weight-imputed original data on the most populated demographic keys. Figure (b) compares synthetic data from training on wgan-imputed originals with weight-imputed original data. Figure (c) compares synthetic data trained on wgan-imputed originals with wgan-imputed original data.
  • Figure 3: Reproduction of self-perceived health in synthetic populations trained on weight- and wgan-imputed original data to their respective training data.
  • Figure 4: Comparison of aggregated statistics on gender, education and age to the final full-scale synthetic population of Helsinki created from the population of Finland and balanced by WGAN synthetic data (Figure a). The fit between aggregated statistics and synthetic population trained on weight-imputed original data is shown in Figure b.
  • Figure 5: Bland-Altman plots. a) Plot for synthetic data trained on weight-imputed originals for Helsinki municipality. b) Plot for synthetic data trained on wgan-imputed originals for Helsinki. c) Plot for the region to which Thessaloniki municipality belongs. The points outside the two confidence interval lines are variable-value combinations that would analytically measure significantly different from the original data.
  • ...and 3 more figures