Table of Contents
Fetching ...

Tabular data generation with tensor contraction layers and transformers

Aníbal Silva, André Restivo, Moisés Santos, Carlos Soares

TL;DR

The paper tackles the challenge of generative modeling for tabular data with its mixed-type features by proposing embedding-based representations processed with tensor contraction layers and transformers within Variational Autoencoders. It introduces three architectures—TensorContracted, Transformed, and TensorConFormer—alongside a baseline VAE to evaluate density estimation and ML-efficiency across the OpenML CC18 suite. Key findings show that tensor contraction layers improve density-estimation metrics (notably for alpha-precision) and that TensorConFormer enhances data diversity, while a transformer-only approach (Transformed) struggles to generalize the data distribution; ML-efficiency remains competitive across non-Transformed variants. These results underline the practical value of combining multi-linear embedding processing with attention mechanisms for realistic, scalable tabular data generation, with implications for privacy-preserving data synthesis and augmentation in real-world datasets.

Abstract

Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.

Tabular data generation with tensor contraction layers and transformers

TL;DR

The paper tackles the challenge of generative modeling for tabular data with its mixed-type features by proposing embedding-based representations processed with tensor contraction layers and transformers within Variational Autoencoders. It introduces three architectures—TensorContracted, Transformed, and TensorConFormer—alongside a baseline VAE to evaluate density estimation and ML-efficiency across the OpenML CC18 suite. Key findings show that tensor contraction layers improve density-estimation metrics (notably for alpha-precision) and that TensorConFormer enhances data diversity, while a transformer-only approach (Transformed) struggles to generalize the data distribution; ML-efficiency remains competitive across non-Transformed variants. These results underline the practical value of combining multi-linear embedding processing with attention mechanisms for realistic, scalable tabular data generation, with implications for privacy-preserving data synthesis and augmentation in real-world datasets.

Abstract

Generative modeling for tabular data has recently gained significant attention in the Deep Learning domain. Its objective is to estimate the underlying distribution of the data. However, estimating the underlying distribution of tabular data has its unique challenges. Specifically, this data modality is composed of mixed types of features, making it a non-trivial task for a model to learn intra-relationships between them. One approach to address mixture is to embed each feature into a continuous matrix via tokenization, while a solution to capture intra-relationships between variables is via the transformer architecture. In this work, we empirically investigate the potential of using embedding representations on tabular data generation, utilizing tensor contraction layers and transformers to model the underlying distribution of tabular data within Variational Autoencoders. Specifically, we compare four architectural approaches: a baseline VAE model, two variants that focus on tensor contraction layers and transformers respectively, and a hybrid model that integrates both techniques. Our empirical study, conducted across multiple datasets from the OpenML CC18 suite, compares models over density estimation and Machine Learning efficiency metrics. The main takeaway from our results is that leveraging embedding representations with the help of tensor contraction layers improves density estimation metrics, albeit maintaining competitive performance in terms of machine learning efficiency.

Paper Structure

This paper contains 46 sections, 20 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Illustrative example of a tensor contraction operation with $M=d=W=2$.
  • Figure 2: Left: Illustration of an embedding based VAE architecture. Right: Encoder and Decoder mappings of TensorConFormer. Each block denotes a feature map inside an encoder/decoder, with the respective input/output dimensions. Arrows denote operations performed over each feature representation.
  • Figure 3: Model comparisons for the considered evaluation metrics using the Bayes Sign Test. Bars denote model comparisons, where each color denotes the probability of a given model (on the left, or right) being practically better than the other, or their performance being practically equivalent using a ROPE of 0.03. TC, TF, and TCF are abbreviations for TensorContracted, Transformed, and TensorConFormer.
  • Figure 4: Radar charts for the considered evaluation metrics based on the average ranking of the dataset and feature size (the lower the radius, the better). Top: Average rank as a function of the dataset size (in thousands). Bottom: Average rank as a function of the feature size. The last column denotes the average rank, over all evaluation metrics.
  • Figure 5: Feature distributions of continuous and categorical variables of pre-selected datasets, conditioned over the majority class. The top row presents the distribution of generated data from the considered models when trained with the whole data, while the bottom row shows the same distribution when the given models are only trained with samples from the majority class.
  • ...and 4 more figures