Table of Contents
Fetching ...

CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling

Zheng Fang, Yulan He, Rob Procter

TL;DR

The paper addresses the limitation of bag-of-words representations in topic modeling by introducing the Contextlized Word Topic Model (CWTM), which leverages contextualized word embeddings from BERT to learn per-word topic vectors and a document-topic vector without BOW. It uses a Wasserstein autoencoder framework, mutual information maximization, masked language modeling, and distribution matching via Maximum Mean Discrepancy to regularize word- and document-level topic distributions, with trainable soft prompts to optimize BERT. The contributions include a novel BOW-free topic model that handles unseen words, demonstrates superior topic coherence and diversity across five datasets, and shows that learned word-topic vectors improve downstream tasks like NER. The work has practical impact for robust, context-aware topic modeling and downstream NLP applications that require word-level topic semantics, particularly in settings with OOV words or evolving vocabularies.

Abstract

Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.

CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling

TL;DR

The paper addresses the limitation of bag-of-words representations in topic modeling by introducing the Contextlized Word Topic Model (CWTM), which leverages contextualized word embeddings from BERT to learn per-word topic vectors and a document-topic vector without BOW. It uses a Wasserstein autoencoder framework, mutual information maximization, masked language modeling, and distribution matching via Maximum Mean Discrepancy to regularize word- and document-level topic distributions, with trainable soft prompts to optimize BERT. The contributions include a novel BOW-free topic model that handles unseen words, demonstrates superior topic coherence and diversity across five datasets, and shows that learned word-topic vectors improve downstream tasks like NER. The work has practical impact for robust, context-aware topic modeling and downstream NLP applications that require word-level topic semantics, particularly in settings with OOV words or evolving vocabularies.

Abstract

Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.
Paper Structure (17 sections, 7 equations, 3 figures, 12 tables)

This paper contains 17 sections, 7 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Model architecture. The contextualised word embeddings of the target document are encoded into word-topic vectors, and they are weighted average pooled to generate the document-topic vector, which is regularized to follow the Dirichlet distribution. The topic vector is then used to reconstruct the document embedding which was learned by a mutual information maximisation objective. A masked language model training objective is also added to regularize the word embeddings.
  • Figure 2: Topic coherence and topic diversity scores across different numbers of topics by different models for 20NG (top row); TagMyNews (2nd row), Twitter (3rd row), DBpedia (4th row), and AGNews (bottom row).
  • Figure B1: Document classification accuracy across different numbers of topics by different models for 20NG (top row left); TagMyNews (top row right), Twitter (2nd row left), DBpedia (2nd row right), and AGNews (bottom row).