InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu; Xinshuai Dong; Thong Nguyen; Chaoqun Liu; Liangming Pan; Anh Tuan Luu

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu

TL;DR

This paper proposes the Cross-lingual Topic Modeling with Mutual Information (InfoCTM), a topic alignment with mutual information method that works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue.

Abstract

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

TL;DR

Abstract

Paper Structure (29 sections, 8 equations, 6 figures, 4 tables)

This paper contains 29 sections, 8 equations, 6 figures, 4 tables.

Introduction
Related Work
Cross-lingual Topic Models
Mutual Information Maximization
Methodology
Problem Setting and Notations
Aligning Topics across Languages By Maximizing Mutual Information
What Causes Repetitive Topics?
Topic Alignment with Mutual Information
Cross-lingual Vocabulary Linking
Objective Function of Topic Alignment with Mutual Information
Cross-lingual Topic Modeling with Mutual Information
Generation of Cross-lingual Documents
Objective Function for Generation of Topic Modeling
Overall Objective Function for InfoCTM
...and 14 more sections

Figures (6)

Figure 1: Illustration of cross-lingual topic models, producing aligned topics of different languages. Words in the brackets are the corresponding English translations.
Figure 2: Cosine distance between the topic representations of words over the course of training. The results show that while the topic representations degenerate into similar values in NMTM Wu2020, our InfoCTM successfully avoids degenerate topic representations.
Figure 3: Illustration of InfoCTM. The generation of cross-lingual documents follows VAE. The proposed topic alignment with mutual information method aligns the topic representations of linked words ("歌曲"(song) and "song" or "album") and also keeps the distance between the topic representations of unlinked words ("歌曲"(song) and "oil" or "chelsea") to avoid degenerate topic representations.
Figure 4: Illustration of Cross-lingual Vocabulary Linking.
Figure 5: Document classification accuracy where "-i" means intra-lingual classification, and "-c" is cross-lingual classification. Involved languages are English (en), Chinese (zh) and Japanese (ja). The improvements of InfoCTM on cross-lingual classification (en-c,zh-c,ja-c) are statistically significant at 0.05 level.
...and 1 more figures

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

TL;DR

Abstract

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)