Table of Contents
Fetching ...

Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation

Wei Cui, Rasa Hosseinzadeh, Junwei Ma, Tongzi Wu, Yi Sui, Keyvan Golestan

TL;DR

This work tackles the challenge of effective self-supervised pretraining for tabular data by introducing two augmentation strategies for contrastive learning under a semi-supervised setting: class-conditioned corruption, which samples replacement values from rows sharing the anchor's predicted class via pseudo-labeling, and correlation-based feature masking, which selects corrupted feature subsets using XGBoost-derived feature importance. Empirical results on OpenML-CC18 show that class-conditioned corruption consistently improves embedding quality and downstream classification performance over conventional random corruption, while correlation-based masking yields no clear gains in these datasets. The findings highlight that semantic alignment through class information can significantly benefit contrastive pretraining on tabular data, whereas simple correlation-based masking may require stronger or more varied correlations to be advantageous. The code is available at the project repository, enabling reproducibility and further exploration of tabular augmentation strategies.

Abstract

Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.

Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation

TL;DR

This work tackles the challenge of effective self-supervised pretraining for tabular data by introducing two augmentation strategies for contrastive learning under a semi-supervised setting: class-conditioned corruption, which samples replacement values from rows sharing the anchor's predicted class via pseudo-labeling, and correlation-based feature masking, which selects corrupted feature subsets using XGBoost-derived feature importance. Empirical results on OpenML-CC18 show that class-conditioned corruption consistently improves embedding quality and downstream classification performance over conventional random corruption, while correlation-based masking yields no clear gains in these datasets. The findings highlight that semantic alignment through class information can significantly benefit contrastive pretraining on tabular data, whereas simple correlation-based masking may require stronger or more varied correlations to be advantageous. The code is available at the project repository, enabling reproducibility and further exploration of tabular augmentation strategies.

Abstract

Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.
Paper Structure (18 sections, 5 equations, 3 figures, 5 tables, 4 algorithms)

This paper contains 18 sections, 5 equations, 3 figures, 5 tables, 4 algorithms.

Figures (3)

  • Figure 1: Contrastive Loss Learning Curves for Competing Augmentation Methods in the Pre-training Stage. Compared to the conventional tabular augmentation (corruption with randomly sampled values), our method achieves noticeably lower contrastive loss at a faster pace. Note that our method matches the loss optimization results by the oracle corruption method (which has the knowledge of all labels).
  • Figure 2: Learned Embeddings from Augmentation Methods. The embeddings are computed by the encoder pre-trained under each method, and are visualized in 3-D space through dimensionality reduction with PCA. Evidently, after pre-training under our approach, the separation between samples (both train and test samples) from different classes is more prominent in the learned embedding space compared to no pre-training or pre-training under the conventional augmentation approach.
  • Figure 3: Classification Accuracy Win Matrix among competing augmentation methods on How to Corrupt. Our method achieves better classification results compared to the conventional approach over a large portion of datasets.