Table of Contents
Fetching ...

Self-Supervised Learning for Neural Topic Models with Variance-Invariance-Covariance Regularization

Weiran Xu, Kengo Hirami, Koji Eguchi

TL;DR

The paper tackles improving neural topic models by integrating Variance-Invariance-Covariance Regularization (VICReg) into the latent topic space, yielding VICNTM and a Deep VICNTM variant. It introduces self-supervised anchor–positive regularization to prevent collapse without relying on negative samples, plus model-based adversarial data augmentation to generate positives. Experiments on three datasets show consistent gains in topic coherence (NPMI), diversity (TD/IRBO), and sometimes perplexity, with ablations highlighting the importance of covariance regularization. The work advances NTMs by combining self-supervised regularization with topic modeling, offering a practical approach to producing more coherent and diverse topics for large document collections.

Abstract

In our study, we propose a self-supervised neural topic model (NTM) that combines the power of NTMs and regularized self-supervised learning methods to improve performance. NTMs use neural networks to learn latent topics hidden behind the words in documents, enabling greater flexibility and the ability to estimate more coherent topics compared to traditional topic models. On the other hand, some self-supervised learning methods use a joint embedding architecture with two identical networks that produce similar representations for two augmented versions of the same input. Regularizations are applied to these representations to prevent collapse, which would otherwise result in the networks outputting constant or redundant representations for all inputs. Our model enhances topic quality by explicitly regularizing latent topic representations of anchor and positive samples. We also introduced an adversarial data augmentation method to replace the heuristic sampling method. We further developed several variation models including those on the basis of an NTM that incorporates contrastive learning with both positive and negative samples. Experimental results on three datasets showed that our models outperformed baselines and state-of-the-art models both quantitatively and qualitatively.

Self-Supervised Learning for Neural Topic Models with Variance-Invariance-Covariance Regularization

TL;DR

The paper tackles improving neural topic models by integrating Variance-Invariance-Covariance Regularization (VICReg) into the latent topic space, yielding VICNTM and a Deep VICNTM variant. It introduces self-supervised anchor–positive regularization to prevent collapse without relying on negative samples, plus model-based adversarial data augmentation to generate positives. Experiments on three datasets show consistent gains in topic coherence (NPMI), diversity (TD/IRBO), and sometimes perplexity, with ablations highlighting the importance of covariance regularization. The work advances NTMs by combining self-supervised regularization with topic modeling, offering a practical approach to producing more coherent and diverse topics for large document collections.

Abstract

In our study, we propose a self-supervised neural topic model (NTM) that combines the power of NTMs and regularized self-supervised learning methods to improve performance. NTMs use neural networks to learn latent topics hidden behind the words in documents, enabling greater flexibility and the ability to estimate more coherent topics compared to traditional topic models. On the other hand, some self-supervised learning methods use a joint embedding architecture with two identical networks that produce similar representations for two augmented versions of the same input. Regularizations are applied to these representations to prevent collapse, which would otherwise result in the networks outputting constant or redundant representations for all inputs. Our model enhances topic quality by explicitly regularizing latent topic representations of anchor and positive samples. We also introduced an adversarial data augmentation method to replace the heuristic sampling method. We further developed several variation models including those on the basis of an NTM that incorporates contrastive learning with both positive and negative samples. Experimental results on three datasets showed that our models outperformed baselines and state-of-the-art models both quantitatively and qualitatively.

Paper Structure

This paper contains 21 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of VICNTM and Deep VICNTM.VICNTM: components connected by red solid arrows. Deep VICNTM: components connected by blue dot arrows. Note that VICNTM performs regularization on the latent representation $\boldsymbol{Z}$ and $\boldsymbol{Z'}$, while Deep VICNTM performs regularization on the high-dimensional embeddings $\boldsymbol{Y}$ and $\boldsymbol{Y'}$
  • Figure 2: Illustration of VICReg. VICReg performs regularization on the high-dimensional embeddings $\boldsymbol{Y}$ and $\boldsymbol{Y'}$
  • Figure 3: Learning curves that confirm the three regularization losses were effectively learned
  • Figure 4: t-SNE plots of latent topic representations on the 20NG dataset
  • Figure 5: Results of NPMI when using different sampling strategies in our proposed models