MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, Jing Chang, Liang Li
TL;DR
<3-5 sentence high-level summary> MaXIFE introduces a scalable multilingual and cross-lingual instruction-following evaluation benchmark spanning 23 languages with 1,667 verifiable tasks, combining rule-based and model-based assessment to enable cross-language comparisons. It constructs the dataset through surveys, model-assisted expansion, and rigorous translation quality control to produce parallel prompts across languages. The study reports heterogeneous results across languages and models, with top systems excelling in high-resource languages and facing substantial gaps in low-resource ones, driven by data availability and cross-lingual transfer. The benchmark is designed to be extensible, facilitating future multilingual alignment research and broader accessibility of LLM capabilities.
Abstract
With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.
