Table of Contents
Fetching ...

Towards Synthesizing High-Dimensional Tabular Data with Limited Samples

Zuqing Li, Junhao Gan, Jianzhong Qi

TL;DR

CtrTab tackles the difficulty of learning high-dimensional tabular data distributions from limited samples by introducing a conditioning control module that ingests perturbed ground-truth inputs. The method couples a frozen denoising network with a trainable control pathway, where Laplace-noise conditioned inputs yield an implicit $L_2$ regularization that stabilizes training and improves generalization. Theoretical analysis formalizes the regularization effect as $ ilde{\mathcal{L}} = \mathcal{L} + \eta^2 \mathcal{L}^R$, and experiments show CtrTab consistently surpasses state-of-the-art diffusion-based tabular models across diverse datasets, including extremely high-dimensional cases (up to 10{,}001 features). The findings demonstrate substantial practical gains in downstream task accuracy (averaging >90% improvements over SOTA) and establish the approach’s robustness for non-privacy-constrained data synthesis with scalable dimensions.

Abstract

Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model's sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.

Towards Synthesizing High-Dimensional Tabular Data with Limited Samples

TL;DR

CtrTab tackles the difficulty of learning high-dimensional tabular data distributions from limited samples by introducing a conditioning control module that ingests perturbed ground-truth inputs. The method couples a frozen denoising network with a trainable control pathway, where Laplace-noise conditioned inputs yield an implicit regularization that stabilizes training and improves generalization. Theoretical analysis formalizes the regularization effect as , and experiments show CtrTab consistently surpasses state-of-the-art diffusion-based tabular models across diverse datasets, including extremely high-dimensional cases (up to 10{,}001 features). The findings demonstrate substantial practical gains in downstream task accuracy (averaging >90% improvements over SOTA) and establish the approach’s robustness for non-privacy-constrained data synthesis with scalable dimensions.

Abstract

Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model's sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.

Paper Structure

This paper contains 41 sections, 2 theorems, 32 equations, 8 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

Let the training objective of CtrTab be to minimize $\mathcal{L} = \mathbb{E}_{\mathbf{x_0}, t, \mathbf{\epsilon}, \mathbf{C}_{f}}{ \| \mathbf{\epsilon} - \mathbf{\epsilon_\theta}(\mathbf{x_t}, t, \mathbf{C}_{f}) \|^2}$, and let the training objective with a noise added condition be to minimize $\ti

Figures (8)

  • Figure 1: Challenges in tabular data synthesis over high-dimensional data. As the dimensionality increases, F1 scores of all existing models in machine learning tests decrease, while those of our model CtrTab remain stable.
  • Figure 2: Overview of CtrTab. The left (blue) is a denoising network, which receives the noisy input $\mathbf{x}_t$ and the timestep $t$, and predicts noise $\epsilon_\theta$. The right (yellow) is the control module, which encodes conditioning input $\mathbf{C}_f$ and injects intermediate features via element-wise addition ($\oplus$). All fusion operations are followed by zero convolution to match dimensions. This modular design allows injecting conditioning signals without altering the diffusion backbone.
  • Figure 3: Impact of noise scale and the control module.
  • Figure 4: Case study on real-world extremely high-dimensional dataset. The dashed lines correspond to models trained on original data.
  • Figure 5: Overview of denoising diffusion model.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof : Proof Sketch
  • Theorem 2
  • proof