Table of Contents
Fetching ...

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang

TL;DR

XIFBench introduces a constraint-rich, multilingual benchmark to systematically evaluate instruction-following in LLMs across six languages. It features 558 instructions with 0–5 constraints across five categories and uses English requirements as semantic anchors, complemented by cultural accessibility annotations and translation-validation checks. The study reveals that language resource levels, constraint types, instruction complexity, and cultural specificity shape cross-lingual adherence, with pronounced gaps in low-resource languages and in full instruction following (IFR). The authors provide open-source code and data, and discuss implications for improving multilingual LLM training and evaluation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (Content, Style, Situation, Format, and Numerical) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

TL;DR

XIFBench introduces a constraint-rich, multilingual benchmark to systematically evaluate instruction-following in LLMs across six languages. It features 558 instructions with 0–5 constraints across five categories and uses English requirements as semantic anchors, complemented by cultural accessibility annotations and translation-validation checks. The study reveals that language resource levels, constraint types, instruction complexity, and cultural specificity shape cross-lingual adherence, with pronounced gaps in low-resource languages and in full instruction following (IFR). The authors provide open-source code and data, and discuss implications for improving multilingual LLM training and evaluation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (Content, Style, Situation, Format, and Numerical) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.

Paper Structure

This paper contains 47 sections, 4 equations, 7 figures, 22 tables.

Figures (7)

  • Figure 1: Discrepancies in LLMs' instruction-following across languages. Given English instruction and Chinese translation, Llama-3.1-8B response exhibits varied constraint-following, as shown in constraint-based evaluation. A concise back-translation (b.t.) of response is provided for reference.
  • Figure 2: The automated pipeline for constructing XIFBench, consisting of three stages with six steps: Constraint Augmentation (§\ref{['sec:constraint_augmentation']}), Requirement Structuring (§\ref{['sec:requirement_structuring']}), and Multilingual Expansion (§\ref{['sec:multilingual_expansion']}). The example shown follows the same instruction as in Figure \ref{['fig:motivation']}.
  • Figure 3: Cross-lingual RFR performance across constraint categories for three representative models. Each radar chart illustrates the RFR scores across different languages within each constraint category.
  • Figure 4: Cross-lingual IFR performance across instruction complexity levels for three representative models. Each group shows IFR scores across languages per additional constraint count (+xC).
  • Figure 5: Cross-lingual RFR and IFR performance of culturally universal (CUI) and specific (CSI) instructions across three models. Each group presents following rates per type across languages.
  • ...and 2 more figures