A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

Wei-Yao Wang; Wei-Wei Du; Derek Xu; Wei Wang; Wen-Chih Peng

A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

Wei-Yao Wang, Wei-Wei Du, Derek Xu, Wei Wang, Wen-Chih Peng

TL;DR

This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD), and presents a formal definition of NS-TD and clarifies its correlation to related studies.

Abstract

Self-supervised learning (SSL) has been incorporated into many state-of-the-art models in various domains, where SSL defines pretext tasks based on unlabeled datasets to learn contextualized and robust representations. Recently, SSL has become a new trend in exploring the representation learning capability in the realm of tabular data, which is more challenging due to not having explicit relations for learning descriptive representations. This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD). We first present a formal definition of NS-TD and clarify its correlation to related studies. Then, these approaches are categorized into three groups - predictive learning, contrastive learning, and hybrid learning, with their motivations and strengths of representative methods in each direction. Moreover, application issues of SSL4NS-TD are presented, including automatic data engineering, cross-table transferability, and domain knowledge integration. In addition, we elaborate on existing benchmarks and datasets for NS-TD applications to analyze the performance of existing tabular models. Finally, we discuss the challenges of SSL4NS-TD and provide potential directions for future research. We expect our work to be useful in terms of encouraging more research on lowering the barrier to entry SSL for the tabular domain, and of improving the foundations for implicit tabular data.

A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 1 figure, 3 tables)

This paper contains 26 sections, 6 equations, 1 figure, 3 tables.

Introduction
Overview
Problem Definition of SSL4NS-TD
Taxonomy
Predictive Learning of SSL4NS-TD
Learning from Masked Features
Perturbation in Latent Space
Inherent in Pre-Trained Language Models
Contrastive Learning of SSL4NS-TD
Hybrid Learning of SSL4NS-TD
Perturbation + Contrastive Learning
Masking + Contrastive Learning
Tackling Application Issues of SSL4NS-TD
Automatic Data Engineering
Cross-Table Transferability
...and 11 more sections

Figures (1)

Figure 1: Overall pipeline of SSL4NS-TD. Given tabular data, the SSL4NS-TD approaches adopt predictive learning ($\S$3), contrastive learning ($\S$4), or hybrid learning ($\S$5) as the self-supervised objective before supervised fine-tuning on the downstream applications. Then, the trained model is evaluated based on the demand-related benchmarks ($\S$7), which are framed as classification or regression problems.

A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

TL;DR

Abstract

A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (1)