Table of Contents
Fetching ...

Top2Vec: Distributed Representations of Topics

Dimo Angelov

TL;DR

Top2Vec introduces a novel approach to topic modeling that leverages joint semantic embeddings of documents and words to discover topics as vectors in a continuous space. By training with doc2vec DBOW and word vectors, then identifying dense regions via UMAP and HDBSCAN, it derives topic centroids and representative words without stop-word removal or predefining topic counts. A topic information gain measure based on mutual information is proposed to evaluate topic quality, and empirical results on the 20 News Groups and Yahoo Answers datasets show Top2Vec yields more informative and representative topics than LDA/PLSA, with useful hierarchical topic reduction. The work demonstrates a practical, scalable framework for interpretable topic discovery in large corpora, with open-source code for broader adoption.

Abstract

Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

Top2Vec: Distributed Representations of Topics

TL;DR

Top2Vec introduces a novel approach to topic modeling that leverages joint semantic embeddings of documents and words to discover topics as vectors in a continuous space. By training with doc2vec DBOW and word vectors, then identifying dense regions via UMAP and HDBSCAN, it derives topic centroids and representative words without stop-word removal or predefining topic counts. A topic information gain measure based on mutual information is proposed to evaluate topic quality, and empirical results on the 20 News Groups and Yahoo Answers datasets show Top2Vec yields more informative and representative topics than LDA/PLSA, with useful hierarchical topic reduction. The work demonstrates a practical, scalable framework for interpretable topic discovery in large corpora, with open-source code for broader adoption.

Abstract

Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present , which leverages joint document and word semantic embedding to find . This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

Paper Structure

This paper contains 19 sections, 5 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: An example of a semantic space. The purple points are documents and the green points are words. Words are closest to documents they best represent and similar documents are close together.
  • Figure 2: 300 dimensional document vectors from the 20 news groups dataset that are embedded into 2 dimensions using UMAP.
  • Figure 3: UMAP-reduced document vectors from the 20 news groups dataset. Each colored area of points is a dense area of documents identified by HDBSCAN, red points are documents HDBSCAN has labeled as noise.
  • Figure 4: The topic vector is the centroid of the dense are of documents identified by HDBSCAN, which are the purple points. The outliers identified by HDBSCAN are not used to calculate the centroid.
  • Figure 5: The topic words are the nearest word vectors to the topic vector.
  • ...and 10 more figures