Table of Contents
Fetching ...

Gromov-Wasserstein Alignment of Word Embedding Spaces

David Alvarez-Melis, Tommi S. Jaakkola

TL;DR

This work reframes cross-lingual word embedding alignment as a Gromov-Wasserstein optimal transport problem, leveraging relational similarities rather than absolute vector positions to learn language mappings in a fully unsupervised manner. The authors develop an efficient GW-based objective, show it can be solved in a single step with minimal tuning, and extend it to large vocabularies via a two-stage scaling approach. Empirical results on standard benchmarks demonstrate competitive performance with substantially lower computational cost and fewer hyper-parameter requirements than state-of-the-art unsupervised methods. The work also provides a geometric, language-distance perspective on embedding spaces, offering interpretable qualitative insights into language relationships.

Abstract

Cross-lingual or cross-domain correspondences play key roles in tasks ranging from machine translation to transfer learning. Recently, purely unsupervised methods operating on monolingual embeddings have become effective alignment tools. Current state-of-the-art methods, however, involve multiple steps, including heuristic post-hoc refinement strategies. In this paper, we cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms. Indeed, we exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages. We show that our OT objective can be estimated efficiently, requires little or no tuning, and results in performance comparable with the state-of-the-art in various unsupervised word translation tasks.

Gromov-Wasserstein Alignment of Word Embedding Spaces

TL;DR

This work reframes cross-lingual word embedding alignment as a Gromov-Wasserstein optimal transport problem, leveraging relational similarities rather than absolute vector positions to learn language mappings in a fully unsupervised manner. The authors develop an efficient GW-based objective, show it can be solved in a single step with minimal tuning, and extend it to large vocabularies via a two-stage scaling approach. Empirical results on standard benchmarks demonstrate competitive performance with substantially lower computational cost and fewer hyper-parameter requirements than state-of-the-art unsupervised methods. The work also provides a geometric, language-distance perspective on embedding spaces, offering interpretable qualitative insights into language relationships.

Abstract

Cross-lingual or cross-domain correspondences play key roles in tasks ranging from machine translation to transfer learning. Recently, purely unsupervised methods operating on monolingual embeddings have become effective alignment tools. Current state-of-the-art methods, however, involve multiple steps, including heuristic post-hoc refinement strategies. In this paper, we cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms. Indeed, we exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages. We show that our OT objective can be estimated efficiently, requires little or no tuning, and results in performance comparable with the state-of-the-art in various unsupervised word translation tasks.

Paper Structure

This paper contains 18 sections, 1 theorem, 12 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

With the choice $L = L_2$, $\text{GW}^{\frac{1}{2}}$ is a distance on the space of metric measure spaces.

Figures (4)

  • Figure 1: The Gromov-Wasserstein distance is well suited for the task of cross-lingual alignment because it relies on relational rather than positional similarities to infer correspondences across domains. Computing it requires two intra-domain similarity (or equivalently cost) matrices (left & center), and it produces an optimal coupling of source and target points with minimal discrepancy cost (right).
  • Figure 2: Training dynamics for the Gromov-Wasserstein alignment problem. The algorithm provably makes progress in each iteration, and the objective (red dashed line) closely follows the metric of interest (translation accuracy, not available during training). More related languages (e.g., En$\rightarrow$Fr in \ref{['fig:dyn_fr_good']},\ref{['fig:dyn_fr_bad']}) lead to faster optimization, while more distant pairs yield slower learning curves (En$\rightarrow$Ru, \ref{['fig:dyn_ru']}).
  • Figure 3: Top: Word embeddings trained on non-comparable corpora can lead to uneven distributions of pairwise distances as shown here for the En-Fi pair of dinu2014improving. Bottom: Normalizing the cost matrices leads to better optimization and improved performance.
  • Figure 4: Pairwise language Gromov-Wasserstein distances obtained as the minimal transportation cost \ref{['eq:gromov_wasserstein']} between word embedding similarity matrices. Values scaled by $10^{2}$ for easy visualization.

Theorems & Definitions (1)

  • Theorem 3.1: memoli2011gromov