Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Dongjie Wang; Yanyong Huang; Wangyang Ying; Haoyue Bai; Nanxu Gong; Xinyuan Wang; Sixun Dong; Tao Zhe; Kunpeng Liu; Meng Xiao; Pengfei Wang; Pengyang Wang; Hui Xiong; Yanjie Fu

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu

TL;DR

The paper surveys tabular data-centric AI with a focus on feature selection and feature generation, contrasting traditional methods (filters, wrappers, embedded, hybrids, and multi-view variants) against advanced approaches (reinforcement learning and generative AI). It provides a structured taxonomy, analyzes strengths and limitations, and offers guidelines for practice, including AutoML, explainability, and privacy considerations. Key contributions include a comprehensive synthesis of algorithms, a comparative analysis across method classes, and actionable directions for future work in dynamic, high-dimensional, and multi-modal tabular settings. The findings underscore the practical impact of hybrid, data-centric pipelines that combine domain knowledge with scalable AI techniques to improve interpretability and predictive performance in real-world applications.

Abstract

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

TL;DR

Abstract

Paper Structure (38 sections, 9 equations, 5 figures, 1 table)

This paper contains 38 sections, 9 equations, 5 figures, 1 table.

Introduction
Fundamentals of Tabular Data
Traditional Approaches to Feature Selection
Single-view Feature Selection
Filter-based Feature Selection
Wrapper-based Feature Selection
Embedded-based Feature Selection
Hybrid Methods for Feature Selection
Multi-view Feature Selection
Supervised Multi-view Feature Selection
Semi-Supervised Multi-view Feature Selection
Unsupervised Multi-view Feature Selection
Challenges and Limitations in Traditional Methods
Traditional Approaches to Feature Generation
Human-Driven Feature Generation
...and 23 more sections

Figures (5)

Figure 1: Real-world applications often generate vast amounts of tabular data, making Data-Centric AI essential for optimizing performance. This survey explores the critical aspects of feature selection and generation, key to advancing tabular data-centric AI.
Figure 2: An overview of the taxonomy for existing tabular data-centric AI techniques.
Figure 3: Feature selection is framed as a RL problem, wherein the agent's actions correspond to the selection of individual features. The optimization objective is to enhance performance on downstream tasks.
Figure 4: Feature generation is formulated as a RL problem, where the agent's actions involve selecting individual features and applying mathematical operations. The optimization goal is same as feature selection.
Figure 5: RL extensively explores the feature space, generating abundant feature learning (feature selection/generation) knowledge. Generative AI captures this knowledge in a continuous embedding space, enabling gradient-based search to identify optimized feature spaces.

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

TL;DR

Abstract

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)