Table of Contents
Fetching ...

Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

Zhihui Xie, Handong Zhao, Tong Yu, Shuai Li

TL;DR

This paper tackles the problem that multilingual language models harbor language-specific signals that hinder truly language-agnostic semantics. It introduces LSAR, a simple unsupervised approach that uses singular value decomposition to identify a low-rank subspace capturing language-specific factors across languages, and then projects embeddings into the subspace's null space to obtain language-agnostic representations without finetuning. Empirical results across sentence retrieval, cross-lingual QA retrieval, and zero-shot classification demonstrate consistent improvements over strong baselines for multiple pretrained models, with notable gains on challenging benchmarks like LAReQA. Analyses reveal that the identified subspace predominantly encodes syntactic information and clusters by language families, offering a principled mechanism for reducing linguistic signals that do not contribute to semantics.

Abstract

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.

Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

TL;DR

This paper tackles the problem that multilingual language models harbor language-specific signals that hinder truly language-agnostic semantics. It introduces LSAR, a simple unsupervised approach that uses singular value decomposition to identify a low-rank subspace capturing language-specific factors across languages, and then projects embeddings into the subspace's null space to obtain language-agnostic representations without finetuning. Empirical results across sentence retrieval, cross-lingual QA retrieval, and zero-shot classification demonstrate consistent improvements over strong baselines for multiple pretrained models, with notable gains on challenging benchmarks like LAReQA. Analyses reveal that the identified subspace predominantly encodes syntactic information and clusters by language families, offering a principled mechanism for reducing linguistic signals that do not contribute to semantics.

Abstract

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.
Paper Structure (34 sections, 1 theorem, 5 equations, 7 figures, 20 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 5 equations, 7 figures, 20 tables, 1 algorithm.

Key Result

Theorem 1

For any matrix $\boldsymbol{M} \in \mathbb{R}^{d \times L}$, Algorithm alg:ours returns $\boldsymbol{\mu} \in \mathbb{R}^{d}, \boldsymbol{M}_{s} \in \mathbb{R}^{d \times r}, \boldsymbol{\Gamma} \in \mathbb{R}^{L \times r}$ that minimize Equation eq:objective where $\boldsymbol{\mu} \perp \text{Span}

Figures (7)

  • Figure 1: Conceptual illustration of our alignment method LSAR. There exists strong language identity information from the original pretrained multilingual representations. By projecting away language-specific components that reside in a low-rank subspace discovered in identification process (in top-right), we can produce a language-agnostic embedding space via language agnosticism rectification (in bottom). The probing procedure (colored in blue-grey) and the inference procedure (colored in yellow) can be done separately.
  • Figure 2: 2D PCA visualization on LAReQA. We display the embeddings collected from mBERT (X-X) on the XQuAD-R sub-dataset. Embeddings of the candidate answers (C) in English, Thai, and Mandarin are shown in small scatters. Embeddings of the question (Q) in English and the ground-truth answers (A) in English, Thai, and Mandarin are shown in large scatters. Higher opacity indicates higher predicted ranking (color bars: //).
  • Figure 3: Answer retrieval mAP on XQuAD-R broken down by question language (row) and answer language (column), with model mBERT (X-X). Only one correct answer is included in the multilingual candidate pool.
  • Figure 4: Removed components along the top two basis vectors of the identified low-rank subspace on mBERT.
  • Figure 5: Language similarity obtained from syntactic signals vs. language similarity measured by language-specific $\boldsymbol{s}_L$ of mBERT. Each point is a language.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof