Table of Contents
Fetching ...

Multi-objective evolutionary GAN for tabular data synthesis

Nian Ran, Bahrul Ilmi Nasution, Claire Little, Richard Allmendinger, Mark Elliot

TL;DR

The paper tackles the challenge of generating tabular synthetic data that preserves utility while minimizing disclosure risk, a problem intensified by mixed variable types and adversarial attacks. It introduces SMOE-CTGAN, which combines a CTGAN backbone with smart multi-objective evolution and deep reinforcement learning to optimize both utility and risk ($ ext{TCAP}$, $ ext{CIO}$, $ ext{ROC}$). A new Improvement Score guides model selection, enabling near-zero risk while achieving high utility across four census datasets, outperforming the CTGAN baseline with Improvement Score in several settings. The work demonstrates the potential of MO optimisation for tabular data synthesis and provides code to reproduce the results.

Abstract

Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products. Generative Adversarial Networks (GANs), typically applied to image synthesis, are also a promising method for tabular data synthesis. However, there are unique challenges in tabular data compared to images, eg tabular data may contain both continuous and discrete variables and conditional sampling, and, critically, the data should possess high utility and low disclosure risk (the risk of re-identifying a population unit or learning something new about them), providing an opportunity for multi-objective (MO) optimization. Inspired by MO GANs for images, this paper proposes a smart MO evolutionary conditional tabular GAN (SMOE-CTGAN). This approach models conditional synthetic data by applying conditional vectors in training, and uses concepts from MO optimisation to balance disclosure risk against utility. Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets. We also find a sweet spot in the early stage of training where a competitive utility and extremely low risk are achieved, by using an Improvement Score. The full code can be downloaded from https://github.com/HuskyNian/SMO\_EGAN\_pytorch.

Multi-objective evolutionary GAN for tabular data synthesis

TL;DR

The paper tackles the challenge of generating tabular synthetic data that preserves utility while minimizing disclosure risk, a problem intensified by mixed variable types and adversarial attacks. It introduces SMOE-CTGAN, which combines a CTGAN backbone with smart multi-objective evolution and deep reinforcement learning to optimize both utility and risk (, , ). A new Improvement Score guides model selection, enabling near-zero risk while achieving high utility across four census datasets, outperforming the CTGAN baseline with Improvement Score in several settings. The work demonstrates the potential of MO optimisation for tabular data synthesis and provides code to reproduce the results.

Abstract

Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products. Generative Adversarial Networks (GANs), typically applied to image synthesis, are also a promising method for tabular data synthesis. However, there are unique challenges in tabular data compared to images, eg tabular data may contain both continuous and discrete variables and conditional sampling, and, critically, the data should possess high utility and low disclosure risk (the risk of re-identifying a population unit or learning something new about them), providing an opportunity for multi-objective (MO) optimization. Inspired by MO GANs for images, this paper proposes a smart MO evolutionary conditional tabular GAN (SMOE-CTGAN). This approach models conditional synthetic data by applying conditional vectors in training, and uses concepts from MO optimisation to balance disclosure risk against utility. Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets. We also find a sweet spot in the early stage of training where a competitive utility and extremely low risk are achieved, by using an Improvement Score. The full code can be downloaded from https://github.com/HuskyNian/SMO\_EGAN\_pytorch.
Paper Structure (14 sections, 13 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 13 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: The training flow of SMOE-CTGAN. It begins with sampling noise from a normal distribution and creating a conditional vector from categorical data. Both are used by the generator to create data, which is then evaluated by a discriminator with the conditional vector included in its input. To get better and diverse offspring, three different loss functions and a deep reinforcement learning algorithm are used to select the best loss function to train the generator at each step given its evaluation values. Then a multi-objective selection operator is applied at regular intervals (select frequency) to choose the best candidates from a mix of parent and offspring solutions for the next generation, using non-dominated sorting and crowding distance during evaluation.
  • Figure 2: Smart Variation: The loss functions used to train generators are selected by a deep reinforcement learning module, the choices of loss functions are actions. The states of generators are encoded by utility and risk values. With the values of states, new states and actions are stored in transitions and then passed to deep reinforcement learning for its training.
  • Figure 3: Training curves on all census datasets by the proposed approach, SMOE-CTGAN.
  • Figure 4: Final population (of 8 solutions) of solutions from a single run plotted against the two objectives and the normalized Improvement Score. The solutions surrounded by circles are the non-dominated solutions.