Table of Contents
Fetching ...

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

TL;DR

SCARF introduces a simple, domain-agnostic self-supervised pre-training scheme for tabular data by generating views through random feature corruption drawn from empirical marginals and optimizing an InfoNCE objective on an encoder plus pre-training head. After pre-training, a classifier is trained on top of the encoder, leading to improved downstream accuracy in fully supervised, semi-supervised, and label-noise settings, across 69 OpenML-CC18 datasets. Ablation studies show SCARF's view construction is robust to scaling, batch size, corruption rate, and temperature, and that its performance gains persist when combined with other techniques. The approach offers a practical, scalable boost for tabular representation learning with broad applicability and potential for further bias-mitigating extensions.

Abstract

Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.

SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption

TL;DR

SCARF introduces a simple, domain-agnostic self-supervised pre-training scheme for tabular data by generating views through random feature corruption drawn from empirical marginals and optimizing an InfoNCE objective on an encoder plus pre-training head. After pre-training, a classifier is trained on top of the encoder, leading to improved downstream accuracy in fully supervised, semi-supervised, and label-noise settings, across 69 OpenML-CC18 datasets. Ablation studies show SCARF's view construction is robust to scaling, batch size, corruption rate, and temperature, and that its performance gains persist when combined with other techniques. The approach offers a practical, scalable boost for tabular representation learning with broad applicability and potential for further bias-mitigating extensions.

Abstract

Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.

Paper Structure

This paper contains 16 sections, 1 equation, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Diagram showing unsupervised Scarf pre-training ( Top) and subsequent supervised fine-tuning ( Bottom). During pre-training, networks $f$ and $g$ are learned to produce good representations of the input data. After pre-training, $g$ is discarded and a classification head $h$ is applied on top of the learned $f$ and both $f$ and $h$ are subsequently fine-tuned for classification.
  • Figure 2: Top: Win matrices comparing pre-training methods against each other, and their improvement to existing solutions. Bottom: Box plots showing the relative improvement of different pre-training methods over baselines (y-axis is zoomed in). We see that Scarf pre-training adds value even when used in conjunction with known techniques.
  • Figure 3: Scarf boosts baseline performance even when $30\%$ of the training labels are corrupted. Notably, it improves state-of-the-art label noise solutions like Deep $k$-NN.
  • Figure 4: Scarf shows even more significant gain in the semi-supervised setting where $25\%$ of the data is labeled and the remaining $75\%$ is not. Strikingly, pre-training with Scarf boosts the performance of self-training and tri-training by several percent.
  • Figure 5: Win matrix for various batch sizes (Left) and corruption rates (Right) for the fully labeled, noiseless setting.
  • ...and 7 more figures