Table of Contents
Fetching ...

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Kevin Heffernan, Onur Çelebi, Holger Schwenk

TL;DR

The paper addresses the difficulty of scaling multilingual representations to many low-resource languages by adopting a teacher–student framework that yields mutually compatible language-family encoders (LASER3) while preserving a shared embedding space. By combining supervised distillation with self-supervised MLM training and a curriculum-based approach, the method effectively leverages monolingual data and enhances performance beyond the original LASER, particularly on FLORES and African-language benchmarks. It demonstrates that language-specific vocabularies and targeted encoder families significantly improve xsim alignment, and shows that mined bitexts substantially boost NMT BLEU scores for dozens of African languages. Overall, the work advances scalable, high-quality bitext mining and NMT for a broad set of low-resource languages, enabling practical cross-lingual transfer and resources expansion.

Abstract

Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. A promising approach has been to train one-for-all multilingual models capable of cross-lingual transfer, but these models often suffer from insufficient capacity and interference between unrelated languages. Instead, we move away from this approach and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. To achieve this, we focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We introduce a new teacher-student training scheme which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting. Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

TL;DR

The paper addresses the difficulty of scaling multilingual representations to many low-resource languages by adopting a teacher–student framework that yields mutually compatible language-family encoders (LASER3) while preserving a shared embedding space. By combining supervised distillation with self-supervised MLM training and a curriculum-based approach, the method effectively leverages monolingual data and enhances performance beyond the original LASER, particularly on FLORES and African-language benchmarks. It demonstrates that language-specific vocabularies and targeted encoder families significantly improve xsim alignment, and shows that mined bitexts substantially boost NMT BLEU scores for dozens of African languages. Overall, the work advances scalable, high-quality bitext mining and NMT for a broad set of low-resource languages, enabling practical cross-lingual transfer and resources expansion.

Abstract

Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. A promising approach has been to train one-for-all multilingual models capable of cross-lingual transfer, but these models often suffer from insufficient capacity and interference between unrelated languages. Instead, we move away from this approach and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. To achieve this, we focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We introduce a new teacher-student training scheme which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting. Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
Paper Structure (22 sections, 1 equation, 1 figure, 5 tables)