Table of Contents
Fetching ...

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Di Wu, Christof Monz

TL;DR

The paper tackles the limited word-level transfer in multilingual MT when shared vocabularies span languages with divergent scripts. It introduces word equivalence graphs built from bilingual alignments and uses a graph neural network to reparameterize the embedding table, with English serving as a pivot to enable multi-hop cross-lingual information flow. Across IWSLT14 and EC30, GraphMerge yields consistent translation gains (up to about 2.3 BLEU on average) with minimal train-time and memory overhead, and retains identical inference latency by deploying reparameterized embeddings. The approach scales to large language sets and also benefits bilingual translation, demonstrating a practical path to improved multilinguality without heavy architectural changes. Overall, the work provides a principled method to strengthen cross-lingual representations by explicitly modeling word-level equivalences and propagating their influence through a graph-structured prior into embeddings.

Abstract

Using a vocabulary that is shared across languages is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, assuming that shared tokens refer to similar meanings across languages. However, when word overlap is small, especially due to different writing systems, transfer is inhibited. In this paper, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) embeddings of words with similar meanings are better aligned across languages, 2) our method achieves consistent BLEU improvements of up to 2.3 points for high- and low-resource MNMT, and 3) less than 1.0\% additional trainable parameters are required with a limited increase in computational costs, while inference time remains identical to the baseline. We release the codebase to the community.

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

TL;DR

The paper tackles the limited word-level transfer in multilingual MT when shared vocabularies span languages with divergent scripts. It introduces word equivalence graphs built from bilingual alignments and uses a graph neural network to reparameterize the embedding table, with English serving as a pivot to enable multi-hop cross-lingual information flow. Across IWSLT14 and EC30, GraphMerge yields consistent translation gains (up to about 2.3 BLEU on average) with minimal train-time and memory overhead, and retains identical inference latency by deploying reparameterized embeddings. The approach scales to large language sets and also benefits bilingual translation, demonstrating a practical path to improved multilinguality without heavy architectural changes. Overall, the work provides a principled method to strengthen cross-lingual representations by explicitly modeling word-level equivalences and propagating their influence through a graph-structured prior into embeddings.

Abstract

Using a vocabulary that is shared across languages is common practice in Multilingual Neural Machine Translation (MNMT). In addition to its simple design, shared tokens play an important role in positive knowledge transfer, assuming that shared tokens refer to similar meanings across languages. However, when word overlap is small, especially due to different writing systems, transfer is inhibited. In this paper, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages. Our experiments demonstrate the advantages of our approach: 1) embeddings of words with similar meanings are better aligned across languages, 2) our method achieves consistent BLEU improvements of up to 2.3 points for high- and low-resource MNMT, and 3) less than 1.0\% additional trainable parameters are required with a limited increase in computational costs, while inference time remains identical to the baseline. We release the codebase to the community.
Paper Structure (31 sections, 5 equations, 3 figures, 15 tables)

This paper contains 31 sections, 5 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: bicycle and fiets have the same meaning, but use different forms, potentially leading to a larger distance $\delta$ between their embeddings ($\mathrm{E}[\cdot]$). Our graph-based module $G$ explicitly reparameterizes the word embeddings ($\mathrm{E}_G[\cdot]$) leading to a reduced distance $\delta'$.
  • Figure 2: Illustration of our framework. The left part denotes the subgraphs we build for each language pair, e.g., EN-DE and EN-NL, which are further merged into a multilingual graph. Since we only rely on English-centric data, the graph is sparse, and only four (of the nine possible) sub-matrices are filled. As shown in the right part, the information from the original embeddings (in grey) transfer and converge into the re-parameterized embeddings (in blue) along the pathways defined in the graph, which are further used by a standard encoder-decoder model. All parameters, including embeddings, are trained from scratch.
  • Figure 3: Zero-shot performance on EC30 (870 language directions), grouped by High-, Medium-, and Low-resource.