Tabular Data Generation using Binary Diffusion

Vitaliy Kinakh; Slava Voloshynovskiy

Tabular Data Generation using Binary Diffusion

Vitaliy Kinakh, Slava Voloshynovskiy

TL;DR

A novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data that outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.

Abstract

Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size. Code and models are available at: https://github.com/vkinakh/binary-diffusion-tabular

Tabular Data Generation using Binary Diffusion

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 3 figures, 5 tables, 1 algorithm)

This paper contains 11 sections, 1 equation, 3 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Data transformation
Binary Diffusion
Results
Conclusions
Sampling algorithm
Evaluation models hyperparameters
Runtime comparison
Implementation details
Effect of sampling steps

Figures (3)

Figure 1: Transformation of tabular data ${\bf t}_0$ into the binary form ${\bf x}_0$. The considered transformation is reversible. The continuous column records are presented with the length $d_{\text{cont }}=32$ and the categorical ones with $d_{\text{cat }}=\log _2 K$, where $K$ stands for the number of categorical classes.
Figure 2: Binary Diffusion training (left) and sampling (right) schemes.
Figure 3: Analysis of model performance for different numbers of sampling steps. DT stands for Decision Tree model, RF stands for Random Forest model and LR stands for Linear/Logistic regression model.

Tabular Data Generation using Binary Diffusion

TL;DR

Abstract

Tabular Data Generation using Binary Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)