Table of Contents
Fetching ...

Convex space learning for tabular synthetic data generation

Manjunath Mahendra, Chaithra Umesh, Saptarshi Bej, Kristian Schultz, Olaf Wolkenhauer

TL;DR

NextConvGeN extends Convex Space Learning to generate complete tabular datasets by learning the convex space of data neighborhoods with a cooperative generator–discriminator framework. It uses Feature-type Distributed Clustering to form meaningful neighborhoods and an alpha clipping mechanism to maintain convexity while reducing exact copies. Across ten biomedical datasets, NextConvGeN and TabDDPM achieve strong utility and realistic data distributions, though privacy-utility trade-offs vary with model choice; GAN-based methods often offer stronger privacy at potential utility costs. The work introduces a new, high-utility paradigm for synthetic clinical data and highlights opportunities for task-specific generation and hybrid approaches to balance utility and privacy in practice.

Abstract

Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.

Convex space learning for tabular synthetic data generation

TL;DR

NextConvGeN extends Convex Space Learning to generate complete tabular datasets by learning the convex space of data neighborhoods with a cooperative generator–discriminator framework. It uses Feature-type Distributed Clustering to form meaningful neighborhoods and an alpha clipping mechanism to maintain convexity while reducing exact copies. Across ten biomedical datasets, NextConvGeN and TabDDPM achieve strong utility and realistic data distributions, though privacy-utility trade-offs vary with model choice; GAN-based methods often offer stronger privacy at potential utility costs. The work introduces a new, high-utility paradigm for synthetic clinical data and highlights opportunities for task-specific generation and hybrid approaches to balance utility and privacy in practice.

Abstract

Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.
Paper Structure (24 sections, 11 equations, 10 figures, 11 tables)

This paper contains 24 sections, 11 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Architecture of the NextConvGeN model. The generator component takes shuffled data neighborhood batches as input and produces a convex combination of the samples from the neighborhood. The discriminator component is responsible for classifying these generated samples and comparing them against a randomly selected outside neighborhood batch of the same size as the synthetic inside neighborhood. The discriminator can be further trained and repurposed as the discriminator as a classifier after completing the NextConvGeN model training.
  • Figure 2: Left Plot (Euclidean Distance): Box plot illustrating the average Euclidean distance between real and synthetic datasets. Lower values indicate better privacy preservation. The similarity across models suggests comparable privacy risk. Centre Plot (Hausdorff Distance): Box plot depicting the Hausdorff distance between real and synthetic datasets. Higher values indicate better privacy preservation. CTGAN exhibits values in a higher range, implying lower privacy risk compared to other models. Right Plot (Cosine Similarity): Box plot displaying the cosine similarity between real and synthetic datasets. Lower values indicate better privacy preservation. CTAB-GAN consistently demonstrates lower cosine distances, implying lower privacy risk in terms of distance.
  • Figure 3: The line plots above depict precision values for membership inference attacks against generative models. Each plot represents precision on the y-axis and the proportion of data available to the attacker on the x-axis, with different colored lines representing various thresholds. CTGAN, CTAB-GAN, and TabDDPM consistently exhibit precision values below $0.5$ across different access proportions and thresholds, indicating a low risk of reidentification. Conversely, NextConvGeN's plot shows precision values around $0.6$ across varying access levels and thresholds, suggesting an increased risk of reidentification.
  • Figure 4: The box plots above showcase the results of the Attribute Inference Attack across different models. For categorical features, F1-scores were calculated, where lower values indicate a lower reidentification risk (left box plot). Root Mean Square (RMS) values were computed for continuous features, with larger values suggesting a lower reidentification risk (right box plot). The plots reveal that CTGAN and CTAB-GAN consistently exhibit lower F1-scores and higher RMS values distributions, indicating a lower reidentification risk compared to TabDDPM and NextConvGeN.
  • Figure 5: PCA visualization of the first two principal components for Real data and synthetic data generated using CTGAN, CTABGAN, NextConvGeN, and TabDDPM models on the Pima Indian Diabetes dataset. The plot demonstrates that the synthetic data produced by the NextConvGeN model closely resembles the distribution of the real data, outperforming the other generative models in preserving the data structure.
  • ...and 5 more figures