Table of Contents
Fetching ...

The Roles of English in Evaluating Multilingual Language Models

Wessel Poelman, Miryam de Lhoneux

TL;DR

The paper investigates how English is used in multilingual LM evaluations, arguing that English often serves as an interface to boost task performance rather than as a natural language for assessing language understanding. It distinguishes two evaluation goals—task performance and multilingual natural language understanding (MLU)—and critiques mixed-prompt approaches that conflate these roles and introduce confounds such as code-switching and script-switching. Through a survey of prompting setups and examples, the authors show that English-as-interface can leak signals and inflate results, advocating for native or true code-switched prompts that better reflect language-specific understanding and instruction-following. The work calls for a shift toward language-understanding-centered evaluation, with implications for benchmark design, data collection, and interpretability of multilingual language models.

Abstract

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.

The Roles of English in Evaluating Multilingual Language Models

TL;DR

The paper investigates how English is used in multilingual LM evaluations, arguing that English often serves as an interface to boost task performance rather than as a natural language for assessing language understanding. It distinguishes two evaluation goals—task performance and multilingual natural language understanding (MLU)—and critiques mixed-prompt approaches that conflate these roles and introduce confounds such as code-switching and script-switching. Through a survey of prompting setups and examples, the authors show that English-as-interface can leak signals and inflate results, advocating for native or true code-switched prompts that better reflect language-specific understanding and instruction-following. The work calls for a shift toward language-understanding-centered evaluation, with implications for benchmark design, data collection, and interpretability of multilingual language models.

Abstract

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.

Paper Structure

This paper contains 10 sections, 1 figure.

Figures (1)

  • Figure 1: Schematic overview of the different roles of English in multilingual LM evaluation.