Table of Contents
Fetching ...

TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

Andrei Margeloiu, Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik

TL;DR

TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs) to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually, creates robust energy landscapes, even in ambiguous class distributions.

Abstract

Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution $ p(\mathbf{x}, y) $ or the class-conditional distribution $ p(\mathbf{x} \mid y) $ often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones. Code is available at https://github.com/andreimargeloiu/TabEBM.

TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

TL;DR

TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs) to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually, creates robust energy landscapes, even in ambiguous class distributions.

Abstract

Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution or the class-conditional distribution often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones. Code is available at https://github.com/andreimargeloiu/TabEBM.
Paper Structure (41 sections, 4 equations, 15 figures, 30 tables, 1 algorithm)

This paper contains 41 sections, 4 equations, 15 figures, 30 tables, 1 algorithm.

Figures (15)

  • Figure 1: Evaluation of TabEBM and other state-of-the-art tabular generative methods across six key metrics (larger area indicates better performance). The results demonstrate that TabEBM excels in data augmentation (utility), with a larger area than all other methods.
  • Figure 2: An overview of TabEBM. We learn distinct class-specific Energy-Based Models (EBMs) $E_{\text{blue}}({\mathbf{x}})$ and $E_{\textcolor{red}{red}}({\mathbf{x}})$ exclusively on the points of their respective class. Each EBM approximates a class-conditional distribution $p({\mathbf{x}} | y)$. TabEBM allows synthetic data generation by sampling from the estimated distributions for each class $p({\mathbf{x}} | y=\textcolor{blue}{blue})$ and $p({\mathbf{x}} | y=\textcolor{red}{red})$.
  • Figure 3: The class-specific energy function $E_c({\mathbf{x}})$ from the surrogate binary task, where the blue region represents low energy (i.e., high data density). Placing the negative samples in a hypercube distant from the data results in an accurate energy function.
  • Figure 4: Mean normalised balanced accuracy improvement (%) across different sample sizes (Left) and across datasets with varying numbers of classes (Right). Because TabPFGen is not applicable for datasets with more than ten classes, we plot short bars at zeros for visual clearance. Positive values indicate that the generator improves downstream classification performance. TabEBM generally outperforms benchmark generators across varying sample sizes and number of classes.
  • Figure 5: Mean normalised balanced accuracy improvement (%) on imbalanced datasets. TabEBM consistently outperforms the Baseline and other generators across different levels of data imbalance.
  • ...and 10 more figures