XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Zhenyu Li; Kehai Chen; Yunfei Long; Xuefeng Bai; Yaoyin Zhang; Xuchen Wei; Juntao Li; Min Zhang

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

Zhenyu Li, Kehai Chen, Yunfei Long, Xuefeng Bai, Yaoyin Zhang, Xuchen Wei, Juntao Li, Min Zhang

TL;DR

XIFBench introduces a constraint-rich, multilingual benchmark to systematically evaluate instruction-following in LLMs across six languages. It features 558 instructions with 0–5 constraints across five categories and uses English requirements as semantic anchors, complemented by cultural accessibility annotations and translation-validation checks. The study reveals that language resource levels, constraint types, instruction complexity, and cultural specificity shape cross-lingual adherence, with pronounced gaps in low-resource languages and in full instruction following (IFR). The authors provide open-source code and data, and discuss implications for improving multilingual LLM training and evaluation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories (Content, Style, Situation, Format, and Numerical) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https://github.com/zhenyuli801/XIFBench.

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

TL;DR

Abstract

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)