Table of Contents
Fetching ...

SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning

Dongsheng Zhu, Zhenyu Mao, Jinghui Lu, Rui Zhao, Fei Tan

TL;DR

Three simple yet effective discrete sentence augmentation schemes are developed that act as minimal noises at lexical level to produce diverse forms of sentences and standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning.

Abstract

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA

SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning

TL;DR

Three simple yet effective discrete sentence augmentation schemes are developed that act as minimal noises at lexical level to produce diverse forms of sentences and standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning.

Abstract

Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA
Paper Structure (29 sections, 4 equations, 5 figures, 9 tables, 3 algorithms)

This paper contains 29 sections, 4 equations, 5 figures, 9 tables, 3 algorithms.

Figures (5)

  • Figure 1: Normalized representation visualization of different augmentation methods and the way they should be optimized.
  • Figure 2: An overview of the framework. The figure can be embodied as a training batch. Each sentence is passed through the augmentation module to generate one positive and one negative for the anchor, the positives generated by other sentences in the batch are also deemed as negatives for the anchor.
  • Figure 3: The syntax tree constructed through dependency parsing and its representation.
  • Figure 4: Cases from SICK-R test set. The heatmap visualizes the spectrum of the weight of words in the sentence representation. We rank sentence pairs based on sorted similarity scores in ascending order. A better ranking should be closer to the ground truth (GT).
  • Figure 5: Parameter sensitivity for (a) proportion of augmented sentence pairs in training and (b) margin $\delta$.