Table of Contents
Fetching ...

Document Embedding with Paragraph Vectors

Andrew M. Dai, Christopher Olah, Quoc V. Le

TL;DR

Paragraph Vectors are evaluated as dense document representations beyond sentiment analysis. The authors compare PVs to LDA, TF-IDF, and averaged word embeddings on Wikipedia and arXiv, varying embedding dimensionality and using a triplet-based semantic similarity framework. They show that PVs outperform LDA on Wikipedia across dimensions and are competitive with LDA at its best topic counts on arXiv, with joint training of word embeddings boosting PV quality. They also demonstrate that simple vector operations on PVs yield meaningful semantic results, enabling local and nonlocal corpus navigation and new analysis techniques.

Abstract

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

Document Embedding with Paragraph Vectors

TL;DR

Paragraph Vectors are evaluated as dense document representations beyond sentiment analysis. The authors compare PVs to LDA, TF-IDF, and averaged word embeddings on Wikipedia and arXiv, varying embedding dimensionality and using a triplet-based semantic similarity framework. They show that PVs outperform LDA on Wikipedia across dimensions and are competitive with LDA at its best topic counts on arXiv, with joint training of word embeddings boosting PV quality. They also demonstrate that simple vector operations on PVs yield meaningful semantic results, enabling local and nonlocal corpus navigation and new analysis techniques.

Abstract

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

Paper Structure

This paper contains 6 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The distributed memory model of Paragraph Vector for an input sentence.
  • Figure 2: The distributed bag of words model of Paragraph Vector.
  • Figure 3: Visualization of Wikipedia paragraph vectors using t-SNE.
  • Figure 4: Results of experiments on the hand-built Wikipedia triplet dataset.
  • Figure 5: Results of experiments on the generated Wikipedia triplet dataset.
  • ...and 1 more figures