Table of Contents
Fetching ...

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Benedikt Ebing, Christian Goldschmied, Goran Glavaš

TL;DR

TransAlign introduces a word aligner that leverages the encoder of a massively multilingual MT model to produce cross-language token alignments for translation-based XLT. By computing a token similarity matrix $S_{xy}=h_xh_y^\top$, applying bidirectional softmax normalization to obtain $\\hat{S}_{xy}$ and $\\hat{S}_{yx}$, and enforcing a mutual-threshold condition, TransAlign derives alignments $A_{ij}$ that are then used for label projection; it can be further refined with a WA-specific fine-tuning objective. Empirically, TransAlign outperforms strong WA baselines and a non-WA label projection method (Codec) on 28 languages for NER and slot labeling, and shows robust intrinsic word-alignment performance with a focus on semantically meaningful content words. The results imply MT encoders can serve as strong, efficient partners for WA, improving translation-based cross-lingual token classification and offering a practical alternative to decoding-based projection.

Abstract

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

TL;DR

TransAlign introduces a word aligner that leverages the encoder of a massively multilingual MT model to produce cross-language token alignments for translation-based XLT. By computing a token similarity matrix , applying bidirectional softmax normalization to obtain and , and enforcing a mutual-threshold condition, TransAlign derives alignments that are then used for label projection; it can be further refined with a WA-specific fine-tuning objective. Empirically, TransAlign outperforms strong WA baselines and a non-WA label projection method (Codec) on 28 languages for NER and slot labeling, and shows robust intrinsic word-alignment performance with a focus on semantically meaningful content words. The results imply MT encoders can serve as strong, efficient partners for WA, improving translation-based cross-lingual token classification and offering a practical alternative to decoding-based projection.

Abstract

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Word alignment performance across layers of vanilla TransAlign. We present the average AER over all 8 language pairs.
  • Figure 2: Word alignment performance for different thresholds of $c$. We evaluate vanilla WAs and present the average AER over all 8 language pairs.
  • Figure 3: Variance of WA model fine-tuning with three distinct random seeds evaluated on translation-based XLT. Results with DeBERTa.