Table of Contents
Fetching ...

M-IFEval: Multilingual Instruction-Following Evaluation

Antoine Dussolle, Andrea Cardeña Díaz, Shota Sato, Peter Devine

TL;DR

English-centric benchmarks for instruction following limit understanding of multilingual LLM capabilities. M-IFEval extends IFEval with French, Japanese, and Spanish, incorporating language-specific instructions and an objective evaluation framework, and evaluates 8 state-of-the-art LLMs with greedy decoding. Results show language- and task-dependent performance, with scripts and diacritics posing notable challenges and no model uniformly excelling across all languages. The work provides a publicly available multilingual benchmark and baselines to guide model development and selection for non-English tasks.

Abstract

Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.

M-IFEval: Multilingual Instruction-Following Evaluation

TL;DR

English-centric benchmarks for instruction following limit understanding of multilingual LLM capabilities. M-IFEval extends IFEval with French, Japanese, and Spanish, incorporating language-specific instructions and an objective evaluation framework, and evaluates 8 state-of-the-art LLMs with greedy decoding. Results show language- and task-dependent performance, with scripts and diacritics posing notable challenges and no model uniformly excelling across all languages. The work provides a publicly available multilingual benchmark and baselines to guide model development and selection for non-English tasks.

Abstract

Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.

Paper Structure

This paper contains 15 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Task Diversity Analysis: Percentage distribution of prompts among instruction groups by language.
  • Figure 2: Instruction following strict-accuracy per instruction group: Spanish (ES).
  • Figure 3: Instruction following strict-accuracy per instruction group: French (FR).
  • Figure 4: Instruction following strict-accuracy per instruction group: Japanese (JA).