CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling
Zheng Fang, Yulan He, Rob Procter
TL;DR
The paper addresses the limitation of bag-of-words representations in topic modeling by introducing the Contextlized Word Topic Model (CWTM), which leverages contextualized word embeddings from BERT to learn per-word topic vectors and a document-topic vector without BOW. It uses a Wasserstein autoencoder framework, mutual information maximization, masked language modeling, and distribution matching via Maximum Mean Discrepancy to regularize word- and document-level topic distributions, with trainable soft prompts to optimize BERT. The contributions include a novel BOW-free topic model that handles unseen words, demonstrates superior topic coherence and diversity across five datasets, and shows that learned word-topic vectors improve downstream tasks like NER. The work has practical impact for robust, context-aware topic modeling and downstream NLP applications that require word-level topic semantics, particularly in settings with OOV words or evolving vocabularies.
Abstract
Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.
