Table of Contents
Fetching ...

Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions

Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen

TL;DR

This survey consolidates the rapidly evolving field of diffusion models for tabular data, addressing a gap in the literature by organizing methods across data augmentation, imputation, trustworthy synthesis, and anomaly detection. It details foundational diffusion frameworks (Gaussian, multinomial, SGMs, SDEs) and extends them to tabular-specific challenges such as mixed-type features, dependencies, and domain constraints. The authors synthesize a broad array of methods, from generic single-table diffusion (e.g., TabDDPM, SOS, STaSy) to healthcare and finance applications, to discrete-diffusion variants and evaluation metrics, highlighting strengths, limitations, and practical trade-offs. They also discuss trustworthy data generation via privacy-preserving and fairness-aware approaches, and they outline future research directions including scalable training, standardized benchmarks, and hybrid multi-modal integrations. Overall, this work aims to accelerate progress by providing a rigorous, mathematics-informed blueprint for advancing diffusion-based tabular data modeling in real-world settings.

Abstract

In recent years, generative models have achieved remarkable performance across diverse applications, including image generation, text synthesis, audio creation, video generation, and data augmentation. Diffusion models have emerged as superior alternatives to Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) by addressing their limitations, such as training instability, mode collapse, and poor representation of multimodal distributions. This success has spurred widespread research interest. In the domain of tabular data, diffusion models have begun to showcase similar advantages over GANs and VAEs, achieving significant performance breakthroughs and demonstrating their potential for addressing unique challenges in tabular data modeling. However, while domains like images and time series have numerous surveys summarizing advancements in diffusion models, there remains a notable gap in the literature for tabular data. Despite the increasing interest in diffusion models for tabular data, there has been little effort to systematically review and summarize these developments. This lack of a dedicated survey limits a clear understanding of the challenges, progress, and future directions in this critical area. This survey addresses this gap by providing a comprehensive review of diffusion models for tabular data. Covering works from June 2015, when diffusion models emerged, to December 2024, we analyze nearly all relevant studies, with updates maintained in a \href{https://github.com/Diffusion-Model-Leiden/awesome-diffusion-models-for-tabular-data}{GitHub repository}. Assuming readers possess foundational knowledge of statistics and diffusion models, we employ mathematical formulations to deliver a rigorous and detailed review, aiming to promote developments in this emerging and exciting area.

Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions

TL;DR

This survey consolidates the rapidly evolving field of diffusion models for tabular data, addressing a gap in the literature by organizing methods across data augmentation, imputation, trustworthy synthesis, and anomaly detection. It details foundational diffusion frameworks (Gaussian, multinomial, SGMs, SDEs) and extends them to tabular-specific challenges such as mixed-type features, dependencies, and domain constraints. The authors synthesize a broad array of methods, from generic single-table diffusion (e.g., TabDDPM, SOS, STaSy) to healthcare and finance applications, to discrete-diffusion variants and evaluation metrics, highlighting strengths, limitations, and practical trade-offs. They also discuss trustworthy data generation via privacy-preserving and fairness-aware approaches, and they outline future research directions including scalable training, standardized benchmarks, and hybrid multi-modal integrations. Overall, this work aims to accelerate progress by providing a rigorous, mathematics-informed blueprint for advancing diffusion-based tabular data modeling in real-world settings.

Abstract

In recent years, generative models have achieved remarkable performance across diverse applications, including image generation, text synthesis, audio creation, video generation, and data augmentation. Diffusion models have emerged as superior alternatives to Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) by addressing their limitations, such as training instability, mode collapse, and poor representation of multimodal distributions. This success has spurred widespread research interest. In the domain of tabular data, diffusion models have begun to showcase similar advantages over GANs and VAEs, achieving significant performance breakthroughs and demonstrating their potential for addressing unique challenges in tabular data modeling. However, while domains like images and time series have numerous surveys summarizing advancements in diffusion models, there remains a notable gap in the literature for tabular data. Despite the increasing interest in diffusion models for tabular data, there has been little effort to systematically review and summarize these developments. This lack of a dedicated survey limits a clear understanding of the challenges, progress, and future directions in this critical area. This survey addresses this gap by providing a comprehensive review of diffusion models for tabular data. Covering works from June 2015, when diffusion models emerged, to December 2024, we analyze nearly all relevant studies, with updates maintained in a \href{https://github.com/Diffusion-Model-Leiden/awesome-diffusion-models-for-tabular-data}{GitHub repository}. Assuming readers possess foundational knowledge of statistics and diffusion models, we employ mathematical formulations to deliver a rigorous and detailed review, aiming to promote developments in this emerging and exciting area.

Paper Structure

This paper contains 40 sections, 35 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Timeline of Generative Models for Tabular Data: Below the timeline, key advancements in traditional machine learning models and deep generative models are shown, while above the timeline, their extensions in tabular data, such as data synthesis and imputation, are highlighted.
  • Figure 2: Taxonomy of Diffusion Models for Tabular Data.