Table of Contents
Fetching ...

TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer

Jiayu Li, Bingyin Zhao, Zilong Zhao, Uzair Javaid, Kevin Yee, Biplab Sikdar

TL;DR

TabTreeFormer tackles the challenge of generating high-quality tabular data by injecting tabular inductive biases into a transformer framework. It combines a tree-based priors module, a dual-quantization tokenizer that captures multimodal continuous features, and ordinal-aware embeddings with a specialized ordinal cross-entropy loss to generate realistic synthetic data. Across nine diverse datasets and eight baselines, TabTreeFormer achieves strong utility and fidelity while maintaining privacy and efficiency, including notable gains in data utility when privacy/efficiency are less constrained. This work demonstrates how domain-specific inductive biases can be effectively integrated into neural generative models to improve tabular data generation for downstream analytics and privacy-preserving data sharing.

Abstract

Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.

TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer

TL;DR

TabTreeFormer tackles the challenge of generating high-quality tabular data by injecting tabular inductive biases into a transformer framework. It combines a tree-based priors module, a dual-quantization tokenizer that captures multimodal continuous features, and ordinal-aware embeddings with a specialized ordinal cross-entropy loss to generate realistic synthetic data. Across nine diverse datasets and eight baselines, TabTreeFormer achieves strong utility and fidelity while maintaining privacy and efficiency, including notable gains in data utility when privacy/efficiency are less constrained. This work demonstrates how domain-specific inductive biases can be effectively integrated into neural generative models to improve tabular data generation for downstream analytics and privacy-preserving data sharing.

Abstract

Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.
Paper Structure (78 sections, 6 theorems, 22 equations, 15 figures, 13 tables, 2 algorithms)

This paper contains 78 sections, 6 theorems, 22 equations, 15 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

Given a real dataset $\mathbf{X}$, train $M$ generative models ($\mathcal{G}_1,\mathcal{G}_2,\dots,\mathcal{G}_M$) on an $M$-partition of it ($\mathbf{X}=[\mathbf{X}^{[1]};\mathbf{X}^{[2]};\dots;\mathbf{X}^{[M]}]$). If the generators are well-trained (distribution $p$ of the corresponding partition

Figures (15)

  • Figure 1: Performance comparison of TabTreeFormer (Ours) with SOTA tabular generative models in utility, fidelity, privacy, and efficiency metrics. TabTreeFormer-S achieves the best balance as the only big near-regular octagon, and TabTreeFormer-NM achieves the best utility.
  • Figure 2: Overview of TabTreeFormer (data flow: bottom $\rightarrow$ top). It consists of 3 components: i) a tree-based model that introduces tabular-specific inductive biases; ii) a tokenizer that efficiently and compactly represents data while capturing multimodal distributions; iii) a transformer model that learns the priors extracted from the tree and tokenizer to generate high-quality synthetic data.
  • Figure 3: Marginal densities of representative multimodal continuous columns from baseline ART and TTF. All have a Distill-GPT2 backbone.
  • Figure 4: TTF versus top-4 baselines in pair-wise correlation. Real (absolute) correlation values are presented on the left, and the absolute error in correlation values in synthetic data from different models, capped at 0.4 for visibility, are shown at the right (the darker the better). The left-top features (divided by yellow line) are categorical and the right-bottom are numeric.
  • Figure 5: Feature count vs. Trend score improvement of TTF from ART baselines.
  • ...and 10 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Theorem 3
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • proof
  • Remark 1
  • ...and 3 more