Table of Contents
Fetching ...

MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li

TL;DR

<3-5 sentence high-level summary> MaXIFE introduces a scalable multilingual and cross-lingual instruction-following evaluation benchmark spanning 23 languages with 1,667 verifiable tasks, combining rule-based and model-based assessment to enable cross-language comparisons. It constructs the dataset through surveys, model-assisted expansion, and rigorous translation quality control to produce parallel prompts across languages. The study reports heterogeneous results across languages and models, with top systems excelling in high-resource languages and facing substantial gaps in low-resource ones, driven by data availability and cross-lingual transfer. The benchmark is designed to be extensible, facilitating future multilingual alignment research and broader accessibility of LLM capabilities.

Abstract

With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.

MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation

TL;DR

<3-5 sentence high-level summary> MaXIFE introduces a scalable multilingual and cross-lingual instruction-following evaluation benchmark spanning 23 languages with 1,667 verifiable tasks, combining rule-based and model-based assessment to enable cross-language comparisons. It constructs the dataset through surveys, model-assisted expansion, and rigorous translation quality control to produce parallel prompts across languages. The study reports heterogeneous results across languages and models, with top systems excelling in high-resource languages and facing substantial gaps in low-resource ones, driven by data availability and cross-lingual transfer. The benchmark is designed to be extensible, facilitating future multilingual alignment research and broader accessibility of LLM capabilities.

Abstract

With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.

Paper Structure

This paper contains 45 sections, 4 figures, 56 tables.

Figures (4)

  • Figure 1: LLMs have different instruction-following capabilities across 3 different languages: English, Malay and Zulu.
  • Figure 2: MaXIFE Structure, its evaluation dataset composition, and evaluation strategy. We provide 795 Basic Questions and 1667 Instructions, where each Basic Question is combined with 1-3 Instructions to form one piece of evaluation data. In the translation phase, we established human translation processes for both Questions and Instructions to verify quality and ensure accuracy. The translation of Questions focuses more on authenticity of expression, while the translation of Instructions emphasizes precision and rigor in word choice, as well as the accuracy of terminology in specific contexts within particular languages.
  • Figure 3: 11 Instruction Categories and 47 Instruction Subcategories.
  • Figure 4: Data Example of Cross-lingual experiment.