Table of Contents
Fetching ...

Using Machine Translation to Augment Multilingual Classification

Adam King

TL;DR

The paper tackles the data bottleneck in multilingual text classification by using machine translation to generate labeled data in multiple languages from a single language annotation. It introduces an original-translated contrastive (OTC) loss, inspired by image-captioning contrastive learning, to align embeddings of sentences across languages during fine-tuning of a multilingual transformer. Experiments on a six-language Amazon reviews dataset translated via M2M100 show that training with translated data yields consistent F1-micro gains (roughly 0.02–0.11), with OTC providing additional improvements and a statistically significant positive effect. The findings demonstrate that MT-augmented data can enable rapid expansion to new languages and that OTC loss helps bridge translation-induced gaps, offering practical value for multilingual NLP deployment.

Abstract

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

Using Machine Translation to Augment Multilingual Classification

TL;DR

The paper tackles the data bottleneck in multilingual text classification by using machine translation to generate labeled data in multiple languages from a single language annotation. It introduces an original-translated contrastive (OTC) loss, inspired by image-captioning contrastive learning, to align embeddings of sentences across languages during fine-tuning of a multilingual transformer. Experiments on a six-language Amazon reviews dataset translated via M2M100 show that training with translated data yields consistent F1-micro gains (roughly 0.02–0.11), with OTC providing additional improvements and a statistically significant positive effect. The findings demonstrate that MT-augmented data can enable rapid expansion to new languages and that OTC loss helps bridge translation-induced gaps, offering practical value for multilingual NLP deployment.

Abstract

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
Paper Structure (9 sections, 3 equations, 2 figures, 3 tables)

This paper contains 9 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example original and translated data. Each unique review (id) in the original dataset was translated to the other languages and assigned the same star value. Texts truncated here for formatting.
  • Figure 2: Full model details for MLE model trained to predict F1-micro per laguage. otc has a positive contribution to an increase F1-micro score, even when controlling for variance between languages and model runs.