Table of Contents
Fetching ...

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

Nandan Thakur, Nils Reimers, Johannes Daxenberger, Iryna Gurevych

TL;DR

AugSBERT introduces a practical data augmentation method that uses a cross-encoder to softly label extra sentence pairs, thereby enhancing SBERT-style bi-encoders for pairwise sentence scoring. By carefully sampling pairs (favoring BM25 and related strategies) and optimizing seeds, AugSBERT delivers consistent in-domain gains (1–6 points) and substantial domain-adaptation improvements (up to 37 points) while preserving efficient indexing. The approach bridges the performance gap between bi-encoders and cross-encoders, offering a scalable, data-efficient path to stronger sentence-similarity and paraphrase tasks across languages and domains.

Abstract

There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

TL;DR

AugSBERT introduces a practical data augmentation method that uses a cross-encoder to softly label extra sentence pairs, thereby enhancing SBERT-style bi-encoders for pairwise sentence scoring. By carefully sampling pairs (favoring BM25 and related strategies) and optimizing seeds, AugSBERT delivers consistent in-domain gains (1–6 points) and substantial domain-adaptation improvements (up to 37 points) while preserving efficient indexing. The approach bridges the performance gap between bi-encoders and cross-encoders, offering a scalable, data-efficient path to stronger sentence-similarity and paraphrase tasks across languages and domains.

Abstract

There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.

Paper Structure

This paper contains 26 sections, 1 equation, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Spearman rank correlation ($\rho$) test scores for different STS Benchmark (English) training sizes.
  • Figure 2: Augmented SBERT In-domain approach
  • Figure 3: Domain adaptation with AugSBERT.
  • Figure 4: Comparison of the density distributions of gold standard with silver standard for various sampling techniques on Spanish-STS (in-domain) dataset.
  • Figure 5: Comparison of density distribution of BWS Argument Similarity dataset with Spanish-STS dataset.
  • ...and 2 more figures