Table of Contents
Fetching ...

On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

Rochelle Choenni, Sara Rajaee, Christof Monz, Ekaterina Shutova

TL;DR

An analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices is presented.

Abstract

While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.

On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

TL;DR

An analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices is presented.

Abstract

While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
Paper Structure (49 sections, 17 figures, 12 tables)

This paper contains 49 sections, 17 figures, 12 tables.

Figures (17)

  • Figure 1: We report the datasets included in each benchmark along with the number of languages that they cover. The datasets are color-coded by type of task: classification, retrieval, question answering, or structured prediction.
  • Figure 2: Number of test languages for each task and the average typological diversity score between them computed as the average cosine similarity between URIEL features of each language pair.
  • Figure 3: Average performance across test languages in a zero-shot fine-tuning setup for XLM-R and in a zero-shot prompting using BLOOMz. Results are categorized per task and data coverage during pretraining as reported in Table \ref{['tab:categorization']}. Results across models are not directly comparable as their language categorizations differ.
  • Figure 4: The accuracy score of XLM-R model on PAWS-X task across 196 languages.
  • Figure 5: The accuracy score of XLM-R model on XNLI task across 196 languages.
  • ...and 12 more figures