Table of Contents
Fetching ...

GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

Mohammadali Fallahian, Mohsen Dorodchi, Kyle Kreth

TL;DR

The paper addresses the challenge of building effective data synopses for Approximate Query Processing (AQP) using Generative Adversarial Networks (GANs) on tabular data. It identifies key obstacles—data-type heterogeneity, bounded continuous attributes, non-Gaussian distributions, imbalanced categorical values, and semantic constraints—and proposes a three-pronged solution: data transformation, distribution matching, and conditional/informed generators. A survey of existing tabular GAN variants (e.g., medGAN, Table-GAN, CTGAN, CTAB-GAN, TGAN, DATGAN) is coupled with a tailored evaluation framework (SDMetrics-based) to assess coverage, constraints, similarity, and relationships between real and synthetic synopses. The findings suggest that advanced tabular GANs can produce high-fidelity synopses that enable faster, still-reliable approximate queries, with significant implications for real-time, data-driven decision making.

Abstract

In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships and then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of the GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.

GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

TL;DR

The paper addresses the challenge of building effective data synopses for Approximate Query Processing (AQP) using Generative Adversarial Networks (GANs) on tabular data. It identifies key obstacles—data-type heterogeneity, bounded continuous attributes, non-Gaussian distributions, imbalanced categorical values, and semantic constraints—and proposes a three-pronged solution: data transformation, distribution matching, and conditional/informed generators. A survey of existing tabular GAN variants (e.g., medGAN, Table-GAN, CTGAN, CTAB-GAN, TGAN, DATGAN) is coupled with a tailored evaluation framework (SDMetrics-based) to assess coverage, constraints, similarity, and relationships between real and synthetic synopses. The findings suggest that advanced tabular GANs can produce high-fidelity synopses that enable faster, still-reliable approximate queries, with significant implications for real-time, data-driven decision making.

Abstract

In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships and then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of the GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.
Paper Structure (23 sections, 28 equations, 13 figures, 2 tables)

This paper contains 23 sections, 28 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Query processing flow diagram in APQ.
  • Figure 2: GANs process flow diagram.
  • Figure 3: Conditional GAN process flow diagram.
  • Figure 4: medGAN architecture: Discriminator utilizes autoencoder (which is learned by real data) to receive decoded random noise variable
  • Figure 5: Pre-processing input data before feeding the discriminator in PNR-GAN
  • ...and 8 more figures