Table of Contents
Fetching ...

Population Synthesis using Incomplete Information

Tanay Rastogi, Daniel Jonsson, Anders Karlström

TL;DR

The paper tackles population synthesis from incomplete microsamples by extending the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) to operate on masked data. A mask matrix $Y$ is applied to the generator output before discrimination, enabling learning from datasets with missing attributes and incorporating gradient penalties and regularizers to balance sampling zeros against structural zeros. Validating on the Swedish travel survey, the authors demonstrate that synthetic populations generated by WGAN-GP trained on incomplete data closely resemble those from complete-data models and align with ground-truth marginals, though high-dimensional cases reveal structural-zero challenges due to limited real-world combinations. The study contributes a robust mask-based training approach for population synthesis under incomplete data, with implications for ABM transport simulations and other domains where privacy and data gaps hinder full microsample utilization. Future work suggests conditioning approaches (e.g., CT-GAN) and combining with marginal information to synthesize future populations while maintaining consistency with known marginals.

Abstract

This paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples. By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information. The proposed method aims to address the challenge of missing information in microsamples on one or more attributes due to privacy concerns or data collection constraints. The paper contrasts WGAN models trained on incomplete microsamples with those trained on complete microsamples, creating a synthetic population. We conducted a series of evaluations of the proposed method using a Swedish national travel survey. We validate the efficacy of the proposed method by generating synthetic populations from all the models and comparing them to the actual population dataset. The results from the experiments showed that the proposed methodology successfully generates synthetic data that closely resembles a model trained with complete data as well as the actual population. The paper contributes to the field by providing a robust solution for population synthesis with incomplete data, opening avenues for future research, and highlighting the potential of deep generative models in advancing population synthesis capabilities.

Population Synthesis using Incomplete Information

TL;DR

The paper tackles population synthesis from incomplete microsamples by extending the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) to operate on masked data. A mask matrix is applied to the generator output before discrimination, enabling learning from datasets with missing attributes and incorporating gradient penalties and regularizers to balance sampling zeros against structural zeros. Validating on the Swedish travel survey, the authors demonstrate that synthetic populations generated by WGAN-GP trained on incomplete data closely resemble those from complete-data models and align with ground-truth marginals, though high-dimensional cases reveal structural-zero challenges due to limited real-world combinations. The study contributes a robust mask-based training approach for population synthesis under incomplete data, with implications for ABM transport simulations and other domains where privacy and data gaps hinder full microsample utilization. Future work suggests conditioning approaches (e.g., CT-GAN) and combining with marginal information to synthesize future populations while maintaining consistency with known marginals.

Abstract

This paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples. By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information. The proposed method aims to address the challenge of missing information in microsamples on one or more attributes due to privacy concerns or data collection constraints. The paper contrasts WGAN models trained on incomplete microsamples with those trained on complete microsamples, creating a synthetic population. We conducted a series of evaluations of the proposed method using a Swedish national travel survey. We validate the efficacy of the proposed method by generating synthetic populations from all the models and comparing them to the actual population dataset. The results from the experiments showed that the proposed methodology successfully generates synthetic data that closely resembles a model trained with complete data as well as the actual population. The paper contributes to the field by providing a robust solution for population synthesis with incomplete data, opening avenues for future research, and highlighting the potential of deep generative models in advancing population synthesis capabilities.

Paper Structure

This paper contains 17 sections, 18 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration showing an example of sample data with corresponding training data and mask. The missing values are represented as NA in the sample, which are replaced by 0 in training data and mask.
  • Figure 2: Heatmap illustrating the disparity between the SCB and generated full population data for the marginal distribution of the population across all 21 counties in Sweden. Each cell displays the % error and corresponding count from SCB 2006 (enclosed in brackets).
  • Figure 3: Conceptual diagram showing the distribution of combination of categories and data types for our study.
  • Figure 4: Plots with the ratio of general sample, sampling zero, structural zero, precision and recall for 16k and 7M dimensional joint data at different sampling levels.
  • Figure :