Table of Contents
Fetching ...

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen, Vu Minh Ngo, Tung Nguyen, Linh Van Ngo, Duc Anh Nguyen, Sang Dinh, Trung Le

TL;DR

XTRA addresses cross-lingual topic modeling by unifying Bag-of-Words representations with multilingual embeddings in a dual-contrastive framework. It jointly aligns document-topic proportions $\theta$ across languages through clustering-based contrastive learning and aligns topic-word distributions $\beta$ across languages by projecting them into a shared semantic space and applying InfoNCE. The model uses a shared encoder with language-specific projections and a VAE foundation to learn coherent and diverse topics, achieving high cross-lingual alignment and robust downstream performance. Across EC News, Amazon Review, and Rakuten Amazon, XTRA outperforms baselines on topic coherence (CNPMI), diversity (TU), and alignment (TQ), while improving cross-lingual classification, demonstrating practical impact for multilingual analysis. Code and reproducible scripts are available at the referenced GitHub repository.

Abstract

Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

TL;DR

XTRA addresses cross-lingual topic modeling by unifying Bag-of-Words representations with multilingual embeddings in a dual-contrastive framework. It jointly aligns document-topic proportions across languages through clustering-based contrastive learning and aligns topic-word distributions across languages by projecting them into a shared semantic space and applying InfoNCE. The model uses a shared encoder with language-specific projections and a VAE foundation to learn coherent and diverse topics, achieving high cross-lingual alignment and robust downstream performance. Across EC News, Amazon Review, and Rakuten Amazon, XTRA outperforms baselines on topic coherence (CNPMI), diversity (TU), and alignment (TQ), while improving cross-lingual classification, demonstrating practical impact for multilingual analysis. Code and reproducible scripts are available at the referenced GitHub repository.

Abstract

Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of cross-lingual topic alignment: both English and Chinese word clusters describe the shared theme of music.
  • Figure 2:
  • Figure 3: Our clustering-based contrastive alignment illustration. We group similar documents across languages into clusters using multilingual embeddings. Each document is aligned with its cluster via contrastive learning on topic distributions ($\theta$), encouraging cross-lingual consistency in the topic space.
  • Figure 4: Overall caption for the three figures showing classification results on different datasets.
  • Figure 5: LLM-based topic quality evaluations on the Amazon Review dataset. Darker shades indicate higher scores. The final reported score for each evaluation is a real number, representing the average of three independent LLM assessments.