Table of Contents
Fetching ...

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

Wangyang Ying, Cong Wei, Nanxu Gong, Xinyuan Wang, Haoyue Bai, Arun Vignesh Malarkkan, Sixun Dong, Dongjie Wang, Denghui Zhang, Yanjie Fu

TL;DR

This survey addresses improving AI performance on tabular data through a data-centric lens, focusing on feature selection and feature generation. It systematically reviews reinforcement learning (RL) and generative AI methods as automated engines for tabular feature engineering, detailing architectures, techniques, and domain applications. Key contributions include a taxonomy of RL-based methods (multi-agent, single-agent, hybrids) and generative AI approaches (GAINS, VTFS, FNFS, GERBIL, LLM-based strategies) and pragmatic guidance for practitioners. The findings underscore trade-offs in performance, interpretability, adaptability, and computation, and point to future directions such as privacy-preserving strategies, LLM/multimodal integration, and scalable offline/online feature engineering for tabular data.

Abstract

Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

TL;DR

This survey addresses improving AI performance on tabular data through a data-centric lens, focusing on feature selection and feature generation. It systematically reviews reinforcement learning (RL) and generative AI methods as automated engines for tabular feature engineering, detailing architectures, techniques, and domain applications. Key contributions include a taxonomy of RL-based methods (multi-agent, single-agent, hybrids) and generative AI approaches (GAINS, VTFS, FNFS, GERBIL, LLM-based strategies) and pragmatic guidance for practitioners. The findings underscore trade-offs in performance, interpretability, adaptability, and computation, and point to future directions such as privacy-preserving strategies, LLM/multimodal integration, and scalable offline/online feature engineering for tabular data.

Abstract

Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

Paper Structure

This paper contains 13 sections, 3 figures.

Figures (3)

  • Figure 1: A taxonomy overview of RL-based and generative techniques in tabular data-centric AI.
  • Figure 2: Feature selection and feature generation are formulated as RL problems.
  • Figure 3: Generative AI captures feature knowledge in a continuous embedding space, enabling gradient-based search to identify optimal feature sets.