Table of Contents
Fetching ...

Representation Learning for Tabular Data: A Comprehensive Survey

Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, Han-Jia Ye

TL;DR

This survey addresses learning representations for tabular data, a ubiquitous yet heterogeneous data format. It introduces a three-tier taxonomy—specialized, transferable, and general/tabular foundation models—and organizes methods around feature, sample, and objective aspects to unify diverse approaches. The work surveys methods, benchmarks, evaluation protocols, ensemble strategies, and extensions, while discussing the trade-offs between traditional tree-based methods and deep tabular models. By outlining open challenges and future directions (e.g., open environments and multimodal tabular learning), it aims to guide researchers and practitioners toward more robust, versatile tabular learners.

Abstract

Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.

Representation Learning for Tabular Data: A Comprehensive Survey

TL;DR

This survey addresses learning representations for tabular data, a ubiquitous yet heterogeneous data format. It introduces a three-tier taxonomy—specialized, transferable, and general/tabular foundation models—and organizes methods around feature, sample, and objective aspects to unify diverse approaches. The work surveys methods, benchmarks, evaluation protocols, ensemble strategies, and extensions, while discussing the trade-offs between traditional tree-based methods and deep tabular models. By outlining open challenges and future directions (e.g., open environments and multimodal tabular learning), it aims to guide researchers and practitioners toward more robust, versatile tabular learners.

Abstract

Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.

Paper Structure

This paper contains 35 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: A brief introduction to tabular data and associated learning tasks. Each row represents an instance and each column corresponds to a specific attribute or feature, which can be numerical or categorical. The most common tabular machine learning tasks are classification and regression as shown in the right side of the the figure.
  • Figure 2: We organize existing tabular classification/regression methods into three categories according to their generalization capabilities: specialized (left), transferable (middle), and general (right) models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without additional fine-tuning.
  • Figure 3: The roadmap of deep representation learning tabular methods. We organize representative methods chronologically to show the concentration at different stages. Different colors of these methods denote the sub-categories.
  • Figure 4: Illustration of feature-aspect methods, including feature encoding, feature selection, feature projection and feature interaction.
  • Figure 5: Illustration of homogeneous transferable tabular methods. The pre-trained model could be constructed from supervised learning or self-supervised learning, which includes masked language model, contrastive pre-training, and hybrid methods.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6