Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

Shourav B. Rabbani; Ivan V. Medri; Manar D. Samad

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

Shourav B. Rabbani, Ivan V. Medri, Manar D. Samad

TL;DR

This work tackles the persistent underperformance of deep learning on tabular data by conducting a large-scale, data-centric benchmark of attention-based and contrastive learning methods against traditional baselines across 28 diverse tabular datasets. It reveals that no single method dominates all datasets; attention-based models excel on hard, high-curvature tasks, while contrastive learning shines in high-dimensional settings, with hybrid approaches like SAINT delivering robust results. The study emphasizes dataset-specific model selection, shows where combinations of attention and contrastive strategies help, and discusses practical limitations such as memory constraints. Overall, it provides actionable guidance for selecting and developing tabular data methods, highlighting the value of combining attention and contrastive learning to push beyond traditional ML, especially in challenging, high-dimensional domains.

Abstract

Despite groundbreaking success in image and text learning, deep learning has not achieved significant improvements against traditional machine learning (ML) when it comes to tabular data. This performance gap underscores the need for data-centric treatment and benchmarking of learning algorithms. Recently, attention and contrastive learning breakthroughs have shifted computer vision and natural language processing paradigms. However, the effectiveness of these advanced deep models on tabular data is sparsely studied using a few data sets with very large sample sizes, reporting mixed findings after benchmarking against a limited number of baselines. We argue that the heterogeneity of tabular data sets and selective baselines in the literature can bias the benchmarking outcomes. This article extensively evaluates state-of-the-art attention and contrastive learning methods on a wide selection of 28 tabular data sets (14 easy and 14 hard-to-classify) against traditional deep and machine learning. Our data-centric benchmarking demonstrates when traditional ML is preferred over deep learning and vice versa because no best learning method exists for all tabular data sets. Combining between-sample and between-feature attentions conquers the invincible traditional ML on tabular data sets by a significant margin but fails on high dimensional data, where contrastive learning takes a robust lead. While a hybrid attention-contrastive learning strategy mostly wins on hard-to-classify data sets, traditional methods are frequently superior on easy-to-classify data sets with presumably simpler decision boundaries. To the best of our knowledge, this is the first benchmarking paper with statistical analyses of attention and contrastive learning performances on a diverse selection of tabular data sets against traditional deep and machine learning baselines to facilitate further advances in this field.

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 4 figures, 9 tables)

This paper contains 31 sections, 1 equation, 4 figures, 9 tables.

Introduction
Background
Preliminaries
Rationale for deep learning of tabular data
Literature review
Methods
Tabular datasets
Baseline neural networks
DNN
Pretraining via an autoencoder
Attention-based learning
TabNet TabNet_Arik2021
FT-Transformer FTT_Gorishniy2021
NPT NPT_Kossen2021
SAINT SAINT_Somepalli2022
...and 16 more sections

Figures (4)

Figure 1: Attention between-feature and between-sample with a classifier head. Here, MHSA is the multi-headed self-attention Vaswani2017. $S$ is the feature embedding including the CLS token FTT_Gorishniy2021NPT_Kossen2021SAINT_Somepalli2022. Between-feature attention improves $S$ to $S'$, and between-sample attention further improves it to $S"$. The final embeddings of the CLS tokens are streamlined in a classifier head to generate class logits.
Figure 2: Formation of positive and negative pairs in SimCLR, SCARF, and our proposed method. Here, $x_i$ and $x_j$ denote two different samples where ($i \neq j$) and $\widehat{x}_i$, $\widehat{x}_j$ are the corrupted versions of original samples.
Figure 3: Win ratio is presented as the row method against the column method. An x/y ratio indicates that the row method is statistically superior to the column method on x datasets out of y datasets that are statistically significant for the row-column method pair.
Figure 4: Percentage F1 score difference between methods. Negative percentages indicate method A outperforms method B in an A versus B comparison. For within-comparison cases, the difference is between the best and worst methods. Hence, the difference is always positive. Full circle markers represent hard datasets, and cross markers represent easy datasets.

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

TL;DR

Abstract

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

Authors

TL;DR

Abstract

Table of Contents

Figures (4)