Table of Contents
Fetching ...

Towards Generalising Neural Topical Representations

Xiaohao Yang, He Zhao, Dinh Phung, Lan Du

TL;DR

This work proposes to enhance NTMs by narrowing the semantic distance between similar documents, with the underlying assumption that documents from different corpora may share similar semantics, and significantly improves the generalisation ability regarding neural topical representation across corpora.

Abstract

Topic models have evolved from conventional Bayesian probabilistic models to recent Neural Topic Models (NTMs). Although NTMs have shown promising performance when trained and tested on a specific corpus, their generalisation ability across corpora has yet to be studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation (i.e., latent distribution over topics) for the document from different target corpora to a certain degree. In this work, we aim to improve NTMs further so that their representation power for documents generalises reliably across corpora and tasks. To do so, we propose to enhance NTMs by narrowing the semantic distance between similar documents, with the underlying assumption that documents from different corpora may share similar semantics. Specifically, we obtain a similar document for each training document by text data augmentation. Then, we optimise NTMs further by minimising the semantic distance between each pair, measured by the Topical Optimal Transport (TopicalOT) distance, which computes the optimal transport distance between their topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora. Our code and datasets are available at: https://github.com/Xiaohao-Yang/Topic_Model_Generalisation.

Towards Generalising Neural Topical Representations

TL;DR

This work proposes to enhance NTMs by narrowing the semantic distance between similar documents, with the underlying assumption that documents from different corpora may share similar semantics, and significantly improves the generalisation ability regarding neural topical representation across corpora.

Abstract

Topic models have evolved from conventional Bayesian probabilistic models to recent Neural Topic Models (NTMs). Although NTMs have shown promising performance when trained and tested on a specific corpus, their generalisation ability across corpora has yet to be studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation (i.e., latent distribution over topics) for the document from different target corpora to a certain degree. In this work, we aim to improve NTMs further so that their representation power for documents generalises reliably across corpora and tasks. To do so, we propose to enhance NTMs by narrowing the semantic distance between similar documents, with the underlying assumption that documents from different corpora may share similar semantics. Specifically, we obtain a similar document for each training document by text data augmentation. Then, we optimise NTMs further by minimising the semantic distance between each pair, measured by the Topical Optimal Transport (TopicalOT) distance, which computes the optimal transport distance between their topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora. Our code and datasets are available at: https://github.com/Xiaohao-Yang/Topic_Model_Generalisation.
Paper Structure (36 sections, 14 equations, 3 figures, 15 tables, 1 algorithm)

This paper contains 36 sections, 14 equations, 3 figures, 15 tables, 1 algorithm.

Figures (3)

  • Figure 1: Neural Topic Model (NTM) with Generalisation Regularisation (Greg). The BOW vectors of a document and its augmentation are encoded as the topical representations, respectively; Besides the common VAE-NTMs that aim to reconstruct the input ("Rec Loss") and match the posterior distribution to the prior ("KL Loss"), we encourage the model to produce a similar $\bm{z}$ for the original document and its augmentation; The distance between $\bm{z}$ is measured by TopicalOT as our "Greg Loss", which is guided by the document cost matrix whose entries specify the OT cost of moving between topics. Our framework can be readily applied to most NTMs as a plug-and-play module. Note: We draw two encoders here for tidy illustration; they are identical.
  • Figure 2: Effect of number of topics (i.e. $K$) to backbones and Greg. The quality of topical representation in terms of different metrics with different numbers of topics is illustrated in the figures.
  • Figure 3: Hyperparameter Sensitivity of Greg. The x-axis of the first row is the regularisation weight $\gamma$; The x-axis of the second row is the augmentation rate $\beta$