Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations
Zhihui Xie, Handong Zhao, Tong Yu, Shuai Li
TL;DR
This paper tackles the problem that multilingual language models harbor language-specific signals that hinder truly language-agnostic semantics. It introduces LSAR, a simple unsupervised approach that uses singular value decomposition to identify a low-rank subspace capturing language-specific factors across languages, and then projects embeddings into the subspace's null space to obtain language-agnostic representations without finetuning. Empirical results across sentence retrieval, cross-lingual QA retrieval, and zero-shot classification demonstrate consistent improvements over strong baselines for multiple pretrained models, with notable gains on challenging benchmarks like LAReQA. Analyses reveal that the identified subspace predominantly encodes syntactic information and clusters by language families, offering a principled mechanism for reducing linguistic signals that do not contribute to semantics.
Abstract
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.
