Table of Contents
Fetching ...

On the Sentence Embeddings from Pre-trained Language Models

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li

TL;DR

The paper tackles the poor semantic utility of BERT sentence embeddings by revealing anisotropy and non-smoothing in the embedding space. It proposes BERT-flow, a flow-based, unsupervised calibration that maps BERT sentence embeddings to a smooth Gaussian latent space via an invertible transformation, inspired by Glow. Empirical results on seven STS benchmarks show significant gains, achieving state-of-the-art performance when combined with NLI supervision and strong improvements even without supervision. The work also demonstrates reduced lexical bias in flow-based similarity and validates the method on unsupervised QNLI entailment tasks, highlighting practical benefits for semantic similarity applications.

Abstract

Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.

On the Sentence Embeddings from Pre-trained Language Models

TL;DR

The paper tackles the poor semantic utility of BERT sentence embeddings by revealing anisotropy and non-smoothing in the embedding space. It proposes BERT-flow, a flow-based, unsupervised calibration that maps BERT sentence embeddings to a smooth Gaussian latent space via an invertible transformation, inspired by Glow. Empirical results on seven STS benchmarks show significant gains, achieving state-of-the-art performance when combined with NLI supervision and strong improvements even without supervision. The work also demonstrates reduced lexical bias in flow-based similarity and validates the method on unsupervised QNLI entailment tasks, highlighting practical benefits for semantic similarity applications.

Abstract

Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.

Paper Structure

This paper contains 31 sections, 7 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: An illustration of our proposed flow-based calibration over the original sentence embedding space of BERT.
  • Figure 2: A scatterplot of sentence pairs, where the horizontal axis represents similarity (either gold standard semantic similarity or embedding-induced similarity), the vertical axis represents edit distance. The sentence pairs with edit distance $\leq 4$ are highlighted with green, meanwhile the rest of the pairs are colored with blue. We can observed that lexically similar sentence pairs tends to be predicted to be similar by BERT embeddings, especially for the green pairs. Such correlation is less evident for gold standard labels or flow-induced embeddings.