Table of Contents
Fetching ...

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu

TL;DR

This paper proposes the Cross-lingual Topic Modeling with Mutual Information (InfoCTM), a topic alignment with mutual information method that works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue.

Abstract

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

TL;DR

This paper proposes the Cross-lingual Topic Modeling with Mutual Information (InfoCTM), a topic alignment with mutual information method that works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue.

Abstract

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.
Paper Structure (29 sections, 8 equations, 6 figures, 4 tables)

This paper contains 29 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of cross-lingual topic models, producing aligned topics of different languages. Words in the brackets are the corresponding English translations.
  • Figure 2: Cosine distance between the topic representations of words over the course of training. The results show that while the topic representations degenerate into similar values in NMTM Wu2020, our InfoCTM successfully avoids degenerate topic representations.
  • Figure 3: Illustration of InfoCTM. The generation of cross-lingual documents follows VAE. The proposed topic alignment with mutual information method aligns the topic representations of linked words ("歌曲"(song) and "song" or "album") and also keeps the distance between the topic representations of unlinked words ("歌曲"(song) and "oil" or "chelsea") to avoid degenerate topic representations.
  • Figure 4: Illustration of Cross-lingual Vocabulary Linking.
  • Figure 5: Document classification accuracy where "-i" means intra-lingual classification, and "-c" is cross-lingual classification. Involved languages are English (en), Chinese (zh) and Japanese (ja). The improvements of InfoCTM on cross-lingual classification (en-c,zh-c,ja-c) are statistically significant at 0.05 level.
  • ...and 1 more figures