Table of Contents
Fetching ...

Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI

Lingxi Cui, Huan Li, Ke Chen, Lidan Shou, Gang Chen

TL;DR

This survey addresses the challenge of limited high-quality tabular data for ML by surveying tabular data augmentation (TDA) techniques, with a focus on generative AI. It presents a three-stage pipeline (pre-augmentation, augmentation, post-augmentation) and a level-based taxonomy that captures row, column, cell, and table granularity, distinguishing retrieval-based and generation-based approaches. It tallies around 70 core works, articulates eight pre-augmentation tasks, and details methods across four augmentation tasks, including entity/schema/cell/ table operations and generation-based variants such as record generation, feature construction, cell imputation, and table synthesis. The paper highlights trends like enhanced table representations, the convergence of retrieval and generation, and automated end-to-end TDA, and discusses opportunities in multimodal data, efficiency, domain-specific applications, interpretability, and privacy, while providing a public, up-to-date repository for the community.

Abstract

Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality tabular data for model training remains a significant obstacle. Numerous works have focused on tabular data augmentation (TDA) to enhance the original table with additional data, thereby improving downstream ML tasks. Recently, there has been a growing interest in leveraging the capabilities of generative AI for TDA. Therefore, we believe it is time to provide a comprehensive review of the progress and future prospects of TDA, with a particular emphasis on the trending generative AI. Specifically, we present an architectural view of the TDA pipeline, comprising three main procedures: pre-augmentation, augmentation, and post-augmentation. Pre-augmentation encompasses preparation tasks that facilitate subsequent TDA, including error handling, table annotation, table simplification, table representation, table indexing, table navigation, schema matching, and entity matching. Augmentation systematically analyzes current TDA methods, categorized into retrieval-based methods, which retrieve external data, and generation-based methods, which generate synthetic data. We further subdivide these methods based on the granularity of the augmentation process at the row, column, cell, and table levels. Post-augmentation focuses on the datasets, evaluation and optimization aspects of TDA. We also summarize current trends and future directions for TDA, highlighting promising opportunities in the era of generative AI. In addition, the accompanying papers and related resources are continuously updated and maintained in the GitHub repository at https://github.com/SuDIS-ZJU/awesome-tabular-data-augmentation to reflect ongoing advancements in the field.

Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI

TL;DR

This survey addresses the challenge of limited high-quality tabular data for ML by surveying tabular data augmentation (TDA) techniques, with a focus on generative AI. It presents a three-stage pipeline (pre-augmentation, augmentation, post-augmentation) and a level-based taxonomy that captures row, column, cell, and table granularity, distinguishing retrieval-based and generation-based approaches. It tallies around 70 core works, articulates eight pre-augmentation tasks, and details methods across four augmentation tasks, including entity/schema/cell/ table operations and generation-based variants such as record generation, feature construction, cell imputation, and table synthesis. The paper highlights trends like enhanced table representations, the convergence of retrieval and generation, and automated end-to-end TDA, and discusses opportunities in multimodal data, efficiency, domain-specific applications, interpretability, and privacy, while providing a public, up-to-date repository for the community.

Abstract

Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality tabular data for model training remains a significant obstacle. Numerous works have focused on tabular data augmentation (TDA) to enhance the original table with additional data, thereby improving downstream ML tasks. Recently, there has been a growing interest in leveraging the capabilities of generative AI for TDA. Therefore, we believe it is time to provide a comprehensive review of the progress and future prospects of TDA, with a particular emphasis on the trending generative AI. Specifically, we present an architectural view of the TDA pipeline, comprising three main procedures: pre-augmentation, augmentation, and post-augmentation. Pre-augmentation encompasses preparation tasks that facilitate subsequent TDA, including error handling, table annotation, table simplification, table representation, table indexing, table navigation, schema matching, and entity matching. Augmentation systematically analyzes current TDA methods, categorized into retrieval-based methods, which retrieve external data, and generation-based methods, which generate synthetic data. We further subdivide these methods based on the granularity of the augmentation process at the row, column, cell, and table levels. Post-augmentation focuses on the datasets, evaluation and optimization aspects of TDA. We also summarize current trends and future directions for TDA, highlighting promising opportunities in the era of generative AI. In addition, the accompanying papers and related resources are continuously updated and maintained in the GitHub repository at https://github.com/SuDIS-ZJU/awesome-tabular-data-augmentation to reflect ongoing advancements in the field.
Paper Structure (39 sections, 5 equations, 11 figures, 6 tables)

This paper contains 39 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Example of TDA for ML: ① A data scientist aims to predict house prices based on factors like location and floor using an original training set ($T^O$) with limited features and records. The initial ML model yields sub-optimal results due to insufficient data and numerous missing or incorrect values. To improve performance, the scientist uses TDA to augment the original dataset with additional attributes (columns), records (rows), and corrected values (cells). ② Before augmentation, preparation steps, such as table annotation (e.g., recovering the missing column type of the 4th column in $T^O$), enhance the TDA process's effectiveness. ③ Augmentation can be achieved through retrieval-based methods (e.g., integrating the 2nd row in $T^C_1$ from external data source) or generation-based methods that synthesize new data. ④ The augmented table ($T^A$) combines the original and new data. ⑤ After augmentation, evaluation steps evaluate the effectiveness of TDA process. ⑥ Finally, the result-TDA table enable the scientist to train a more accurate price prediction model.
  • Figure 2: Instantiation of TDA tasks at different levels: (a) row-level TDA, which adds new rows to the original table, and (b) column-level TDA, which introduces new columns, (c) cell-level TDA addresses missing values, while (d) table-level TDA enhances the table by adding both rows and columns. From another viewpoint, retrieval-based TDA methods (grey arrows) augment the original table with data sourced from the table pool, while generation-based TDA methods (blue arrows) generate new data directly based on the original table. Both retrieval- and generation-based TDA methods can be implemented at various levels, including row, column, cell, and table. The illustration shows only some of these possibilities for clarity. We use the housing dataset DVN/DHZ9VY_2018 for the toy example.
  • Figure 3: The overview of TDA pipeline and the task-based taxonomy for TDA approaches. The input and output of the TDA pipeline are the original table $T^O$ and the augmented table $T^A$, respectively. The TDA pipeline comprises three main procedures: pre-augmentation, augmentation, and post-augmentation.
  • Figure 4: The illustration of table annotation, including (a) ontology-based table annotation, (b) supervised-learning-based table annotation, and (c) PLM-based table annotation.
  • Figure 5: The illustration of table indices, including (a) inverted index, (b) LSH index, and (c) HNSW index.
  • ...and 6 more figures