Table of Contents
Fetching ...

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

Rita González-Márquez, Philipp Berens, Dmitry Kobak

Abstract

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

Abstract

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

Paper Structure

This paper contains 38 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: MTEB score during fine-tuning. Improvement on block mean MTEB score and individual task modalities within the first fine-tuning epoch on ICLR dataset.
  • Figure 2: Dataset visualizations.$t$-SNE visualizations of MPNet, SBERT, and cropping-based fine-tuned MPNet embeddings of different datasets. Color corresponds to class labels. Numbers show $k$NN accuracy in 768D embedding space. We used openTSNE with default parameters policar2024opentsne. The fine-tuned model was always fine-tuned on the same dataset, as in Table \ref{['tab:mteb_datasets']}.
  • Figure 3: Sentence vs. domain adaptation.$k$NN accuracy on the ICLR dataset for MPNet fine-tuned separately on four different datasets (arXiv, bioRxiv, Reddit, ICLR).
  • Figure 4: Representation quality across layers.(a)$k$NN accuracy after fine-tuning MPNet with different number of initial layers frozen. The embedding layer was frozen in all settings. Zero unfrozen layers corresponds to no fine-tuning. (b)$k$NN accuracy after each layer for MPNet before and after fine-tuning all layers, for SBERT, for a randomly initialized model with and without training with cropping augmentations. Layer 0 corresponds to the embedding layer. (c) MTEB block average score after each layer for MPNet before and after fine-tuning with both kinds of augmentations, and for SBERT. Layer 0 corresponds to the embedding layer. Here the text embeddings were not normalized for the evaluation, unlike in Section \ref{['sec:mteb_tasks']}, therefore exact values are slightly different from Table \ref{['tab:mteb_tasks']}.
  • Figure S1: Hyperparameter tuning.$k$NN accuracies on the ICLR dataset used for self-supervised training and evaluation, as a function of different hyperparameter values. (a) Temperature $\tau$ used to scale the similarities in the loss function. (b) Number of consecutive sentences $t$ used in the cropping augmentation. The minibatch size $b$ was adapted depending on $t$ to make it fit into our GPU memory: we used $b=128$ for $t=1$; $b=64$ for $t=2,3,4$; $b=32$ for $t=5,6,7,8,9$; and $b=16$ for $t=10$. (c) Fraction of masked tokens used in addition of the cropping augmentation. (d) Learning rate $\eta$ used by the Adam optimizer.
  • ...and 2 more figures