Table of Contents
Fetching ...

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi

TL;DR

Problem: Learn robust, general-purpose sentence embeddings in an unsupervised setting at scale. Approach: Sent2Vec extends the C-BOW objective by composing sentence representations from averaged word and n-gram embeddings and training with negative sampling for efficiency. Contributions: a simple, fast, and scalable sentence embedding method with $O(|S|h)$ ($|R(S)|h$ for n-grams) training/inference, demonstrating strong performance on unsupervised benchmarks and competitive results on supervised tasks. Impact: enables practical deployment of versatile sentence embeddings across NLP applications with limited supervision and large unlabeled data.

Abstract

The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsupervised models on most benchmark tasks, highlighting the robustness of the produced general-purpose sentence embeddings.

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

TL;DR

Problem: Learn robust, general-purpose sentence embeddings in an unsupervised setting at scale. Approach: Sent2Vec extends the C-BOW objective by composing sentence representations from averaged word and n-gram embeddings and training with negative sampling for efficiency. Contributions: a simple, fast, and scalable sentence embedding method with ( for n-grams) training/inference, demonstrating strong performance on unsupervised benchmarks and competitive results on supervised tasks. Impact: enables practical deployment of versatile sentence embeddings across NLP applications with limited supervision and large unlabeled data.

Abstract

The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsupervised models on most benchmark tasks, highlighting the robustness of the produced general-purpose sentence embeddings.

Paper Structure

This paper contains 17 sections, 8 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Left figure: the profile of the word vector $L_2$-norms as a function of $\log(f_w)$ for each vocabulary word $w$, as learnt by our unigram model trained on Toronto books. Right figure: down-weighting scheme proposed by arora2017: $weight(w) = \frac{a}{a+f_w}$.