Table of Contents
Fetching ...

Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

Multi-lingual Functional Evaluation for Large Language Models

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

Paper Structure

This paper contains 38 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Description of the functional evaluation paradigm. Unlike with static data benchmarks, in the functional evaluation paradigm, model input prompts are not fixed but generated through a fixed template and a set of variables $X$ (modifiable prompt attributes meant to impact model outputs) and a set of distractors $D$ (modifiable prompt attributes meant to be ignored). The ground truth in this setting is generated through a fixed functional transformation $f(X)$. For instance, the prompt "Sally bought 2 red apples and 3 green apples. How much fruit did Sally buy?" is generated from the fixed template "$\{name\}$ bought $\{n_1\}$$\{color_1\}$ apples and $\{n_2\}$$\{color_2\}$ apples. How much fruit did $\{name\}$ buy?". This template involves the variables $X =\{n_1, n_2\}$ and the distractors $D =\{name, color_1, color_2\}$. The correct fixed output function in this case is $f(X)=n_1+n_2$.
  • Figure 2: Correlation plot of Performance Gap between MGSM, MMLU, Belebele (left-to-right) and and CL-IFEval for High Resourced Languages only (en, fr, es). This reveals that measured language performance gaps (i.e. the difference between the performance on the highest performant language and the lowest performant language) are notably larger in functional evaluations than in static data benchmarks.
  • Figure 3: Cross-Lingual IFEval Aya-23-35B Model Comparison Across Languages
  • Figure 4: Cross-Lingual IFEval Gemma-2-9b-it Model Comparison Across Languages
  • Figure 5: Cross-Lingual IFEval Qwen3-8b Model Comparison Across Languages
  • ...and 3 more figures