Table of Contents
Fetching ...

DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries

Xiaodong Liu

TL;DR

A novel method to progressively build entry embeddings not subject to the limitations of word embeddings of language models is proposed, which can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries.

Abstract

This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.

DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries

TL;DR

A novel method to progressively build entry embeddings not subject to the limitations of word embeddings of language models is proposed, which can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries.

Abstract

This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.
Paper Structure (18 sections, 14 equations, 7 figures, 8 tables)

This paper contains 18 sections, 14 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Progressive Separate Training (PST) shown by the example of BERT-base-uncased. To echo the gist of our paper, we use the visualization of 2-d vector space to illustrate entry embeddings, where entries are distributed on the two semantic axes with biggest variances. In this example, the pooling used to encode sentence embeddings during training is CLS and entry embeddings are $AMP$, and the total step of PST is three, but they can be different for other models. The used dictionary dataset is $a$ in Table \ref{['tab:datasets']}.
  • Figure 2: The PCA visualizations (with whitening) that correspond to the SVD Visualizations.
  • Figure 4: Simplified PST streamlines for four raw pre-trained models based on Dataset $a$. The circles indicate the most effective vector spaces for training encoders, and corresponding results are shown in Table \ref{['tab:raw-models']}.
  • Figure 5: The PCA visualizations (with whitening) that correspond to the SVD Visualizations for BERT-base-uncased and BERT-large-uncased. The black dotted circles highlight ideal spheres while the purple dotted circles highlight the entries protruding from the centered sphere.
  • Figure 6: The PCA visualizations (with whitening) that correspond to the SVD Visualizations for RoBERTa-base and RoBERTa-large. The black dotted circles highlight the ideal spheres.
  • ...and 2 more figures