XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen; Vu Minh Ngo; Tung Nguyen; Linh Van Ngo; Duc Anh Nguyen; Sang Dinh; Trung Le

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen, Vu Minh Ngo, Tung Nguyen, Linh Van Ngo, Duc Anh Nguyen, Sang Dinh, Trung Le

TL;DR

XTRA addresses cross-lingual topic modeling by unifying Bag-of-Words representations with multilingual embeddings in a dual-contrastive framework. It jointly aligns document-topic proportions $\theta$ across languages through clustering-based contrastive learning and aligns topic-word distributions $\beta$ across languages by projecting them into a shared semantic space and applying InfoNCE. The model uses a shared encoder with language-specific projections and a VAE foundation to learn coherent and diverse topics, achieving high cross-lingual alignment and robust downstream performance. Across EC News, Amazon Review, and Rakuten Amazon, XTRA outperforms baselines on topic coherence (CNPMI), diversity (TU), and alignment (TQ), while improving cross-lingual classification, demonstrating practical impact for multilingual analysis. Code and reproducible scripts are available at the referenced GitHub repository.

Abstract

Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

TL;DR

XTRA addresses cross-lingual topic modeling by unifying Bag-of-Words representations with multilingual embeddings in a dual-contrastive framework. It jointly aligns document-topic proportions

across languages through clustering-based contrastive learning and aligns topic-word distributions

across languages by projecting them into a shared semantic space and applying InfoNCE. The model uses a shared encoder with language-specific projections and a VAE foundation to learn coherent and diverse topics, achieving high cross-lingual alignment and robust downstream performance. Across EC News, Amazon Review, and Rakuten Amazon, XTRA outperforms baselines on topic coherence (CNPMI), diversity (TU), and alignment (TQ), while improving cross-lingual classification, demonstrating practical impact for multilingual analysis. Code and reproducible scripts are available at the referenced GitHub repository.

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

TL;DR

Abstract

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)