Table of Contents
Fetching ...

Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models

Chia-Hsuan Chang, Tien-Yuan Huang, Yi-Hang Tsai, Chia-Ming Chang, San-Yih Hwang

TL;DR

This paper tackles cross-lingual topic identification by diagnosing language-dependent dimensions in multilingual language model embeddings that bias clustering-based topic models toward language rather than semantics. It introduces two SVD-based refinements, u-SVD and SVD-LR, to create $E^*$ representations that suppress language signals before clustering, with $E \,=\,U\Sigma V^T$ guiding the refinement. Across Airiti Thesis, ECNews, and Rakuten Amazon, and using mBERT, Distilled XLM-R, or Cohere, the updated pipeline generally yields higher CNPMI and Topic Quality than baselines and competing CLTMs, and shows robustness to embedding size and language pair. The approach provides a resource-efficient path to improve cross-lingual topic coherence and semantic alignment, with potential applicability to other multilingual clustering-based NLP tasks.

Abstract

Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.

Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models

TL;DR

This paper tackles cross-lingual topic identification by diagnosing language-dependent dimensions in multilingual language model embeddings that bias clustering-based topic models toward language rather than semantics. It introduces two SVD-based refinements, u-SVD and SVD-LR, to create representations that suppress language signals before clustering, with guiding the refinement. Across Airiti Thesis, ECNews, and Rakuten Amazon, and using mBERT, Distilled XLM-R, or Cohere, the updated pipeline generally yields higher CNPMI and Topic Quality than baselines and competing CLTMs, and shows robustness to embedding size and language pair. The approach provides a resource-efficient path to improve cross-lingual topic coherence and semantic alignment, with potential applicability to other multilingual clustering-based NLP tasks.

Abstract

Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Two resultant scenarios of clustering-based topic model. Different shapes indicate the documents discussing various topics, while different colors represent documents of different languages.
  • Figure 2: Top 3 language-dependent dimensions, sorted by t-statistic values, for original embeddings and embeddings reduced using UMAP, SVD and u-SVD. We utilize the Cohere multilingual model (see Section \ref{['sec: multilingual language model']}) to encode the documents in one of our experimental datasets, namely ECNews. The value distributions for Chinese (cn) and English (en) documents are indicated by red and blue, respectively. All UMAP, SVD, and u-SVD reduced the dimension size of the original representations from 768 to 100. Appendix \ref{['appendix:Rakuten_t_test']} presents the same analysis to the other dataset, namely Rakuten Amazon.
  • Figure 3: Sensitivity analysis of u-SVD and SVD-LR on different dimensions.
  • Figure 4: Top 3 language-dependent dimensions, sorted by t-statistic values, for original embeddings and embeddings reduced using UMAP, SVD and u-SVD on Rakuten Amazon dataset.