Table of Contents
Fetching ...

Does mBERT understand Romansh? Evaluating word embeddings using word alignment

Eyal Liron Dolev

TL;DR

This paper investigates whether embeddings from multilingual language models, especially mBERT, can support word alignment for Romansh in a zero-shot German–Romansh setting. It compares similarity-based aligners (SimAlign, awesome-align) using MLM embeddings against traditional statistical models on the DERMIT corpus, introducing a German–Romansh gold standard. The results show that mBERT-based alignment achieves an AER of 0.22, outperforming fast_align and matching performance on seen-language pairs, with further gains (AER = 0.09) after parallel-data fine-tuning. The work provides a new trilingual resource (DERMIT) and demonstrates the practical potential of MLMs for processing Romansh, a historically under-resourced language, with implications for cross-lingual NLP development.

Abstract

We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh. To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.

Does mBERT understand Romansh? Evaluating word embeddings using word alignment

TL;DR

This paper investigates whether embeddings from multilingual language models, especially mBERT, can support word alignment for Romansh in a zero-shot German–Romansh setting. It compares similarity-based aligners (SimAlign, awesome-align) using MLM embeddings against traditional statistical models on the DERMIT corpus, introducing a German–Romansh gold standard. The results show that mBERT-based alignment achieves an AER of 0.22, outperforming fast_align and matching performance on seen-language pairs, with further gains (AER = 0.09) after parallel-data fine-tuning. The work provides a new trilingual resource (DERMIT) and demonstrates the practical potential of MLMs for processing Romansh, a historically under-resourced language, with implications for cross-lingual NLP development.

Abstract

We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh. To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.
Paper Structure (29 sections, 2 equations, 9 figures, 5 tables)

This paper contains 29 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Aligning the German compounds Webseite ("website") and Brandversicherung ("fire insurance") to Romansh noun phrases. Only lexemes are aligned with each other. Romansh Prepositions are left unaligned.
  • Figure 2: Alignment of German preterite to Romansh perfect. The German word war is translated to Romansh è stà. Nonetheless, è is left unaligned since it only carries grammatical information (tense, number), but no lexical information.
  • Figure 3: The German verb zurückweisen ("reject, decline"), here separated into two words since it is used as the finite verb in the main clause, corresponds to the Romansh verb renviar. This results in a 2-to-1 alignment.
  • Figure 4: Performance of our baseline statistical models with relation to the dataset size.
  • Figure 5: Comparison of AER between the three systems (lower is better). The performance of fast_align and eflomal profits from more data. The performance of SimAlign and awesome-align is not dependent on dataset size.
  • ...and 4 more figures