Towards Synthesizing High-Dimensional Tabular Data with Limited Samples
Zuqing Li, Junhao Gan, Jianzhong Qi
TL;DR
CtrTab tackles the difficulty of learning high-dimensional tabular data distributions from limited samples by introducing a conditioning control module that ingests perturbed ground-truth inputs. The method couples a frozen denoising network with a trainable control pathway, where Laplace-noise conditioned inputs yield an implicit $L_2$ regularization that stabilizes training and improves generalization. Theoretical analysis formalizes the regularization effect as $ ilde{\mathcal{L}} = \mathcal{L} + \eta^2 \mathcal{L}^R$, and experiments show CtrTab consistently surpasses state-of-the-art diffusion-based tabular models across diverse datasets, including extremely high-dimensional cases (up to 10{,}001 features). The findings demonstrate substantial practical gains in downstream task accuracy (averaging >90% improvements over SOTA) and establish the approach’s robustness for non-privacy-constrained data synthesis with scalable dimensions.
Abstract
Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model's sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.
