Table of Contents
Fetching ...

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu

TL;DR

The paper surveys tabular data-centric AI with a focus on feature selection and feature generation, contrasting traditional methods (filters, wrappers, embedded, hybrids, and multi-view variants) against advanced approaches (reinforcement learning and generative AI). It provides a structured taxonomy, analyzes strengths and limitations, and offers guidelines for practice, including AutoML, explainability, and privacy considerations. Key contributions include a comprehensive synthesis of algorithms, a comparative analysis across method classes, and actionable directions for future work in dynamic, high-dimensional, and multi-modal tabular settings. The findings underscore the practical impact of hybrid, data-centric pipelines that combine domain knowledge with scalable AI techniques to improve interpretability and predictive performance in real-world applications.

Abstract

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

TL;DR

The paper surveys tabular data-centric AI with a focus on feature selection and feature generation, contrasting traditional methods (filters, wrappers, embedded, hybrids, and multi-view variants) against advanced approaches (reinforcement learning and generative AI). It provides a structured taxonomy, analyzes strengths and limitations, and offers guidelines for practice, including AutoML, explainability, and privacy considerations. Key contributions include a comprehensive synthesis of algorithms, a comparative analysis across method classes, and actionable directions for future work in dynamic, high-dimensional, and multi-modal tabular settings. The findings underscore the practical impact of hybrid, data-centric pipelines that combine domain knowledge with scalable AI techniques to improve interpretability and predictive performance in real-world applications.

Abstract

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.
Paper Structure (38 sections, 9 equations, 5 figures, 1 table)

This paper contains 38 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Real-world applications often generate vast amounts of tabular data, making Data-Centric AI essential for optimizing performance. This survey explores the critical aspects of feature selection and generation, key to advancing tabular data-centric AI.
  • Figure 2: An overview of the taxonomy for existing tabular data-centric AI techniques.
  • Figure 3: Feature selection is framed as a RL problem, wherein the agent's actions correspond to the selection of individual features. The optimization objective is to enhance performance on downstream tasks.
  • Figure 4: Feature generation is formulated as a RL problem, where the agent's actions involve selecting individual features and applying mathematical operations. The optimization goal is same as feature selection.
  • Figure 5: RL extensively explores the feature space, generating abundant feature learning (feature selection/generation) knowledge. Generative AI captures this knowledge in a continuous embedding space, enabling gradient-based search to identify optimized feature spaces.