Table of Contents
Fetching ...

Unsupervised hard Negative Augmentation for contrastive learning

Yuxuan Shu, Vasileios Lampos

TL;DR

This work targets the underexplored area of negative data augmentation in self-supervised contrastive NLP. It introduces Unsupervised hard Negative Augmentation (UNA), a TF-IDF-guided method that creates hard negatives by selectively replacing informative terms in sentences, ensuring replacements have comparable importance to those removed. Implemented during pre-training and compatible with paraphrasing, UNA yields consistent improvements in STS tasks across BERT- and RoBERTa-based backbones, with particularly strong gains when combined with paraphrasing. The approach offers a simple, efficient augmentation strategy that strengthens contrastive learning signals and enhances downstream semantic similarity performance, while highlighting limitations to English and sentence-level TF-IDF usage for future work.

Abstract

We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.

Unsupervised hard Negative Augmentation for contrastive learning

TL;DR

This work targets the underexplored area of negative data augmentation in self-supervised contrastive NLP. It introduces Unsupervised hard Negative Augmentation (UNA), a TF-IDF-guided method that creates hard negatives by selectively replacing informative terms in sentences, ensuring replacements have comparable importance to those removed. Implemented during pre-training and compatible with paraphrasing, UNA yields consistent improvements in STS tasks across BERT- and RoBERTa-based backbones, with particularly strong gains when combined with paraphrasing. The approach offers a simple, efficient augmentation strategy that strengthens contrastive learning signals and enhances downstream semantic similarity performance, while highlighting limitations to English and sentence-level TF-IDF usage for future work.

Abstract

We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.
Paper Structure (33 sections, 4 equations, 1 figure, 10 tables)