WildIFEval: Instruction Following in the Wild
Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor
TL;DR
WildIFEval introduces a large-scale, real-world benchmark for instruction following under multiple constraints. It aggregates 7,523 user-generated constrained tasks (24,731 constraints) from LMSYS-Chat-1M, with a top-down decomposition of constraints and a detailed taxonomy and diversity analysis. The paper benchmarks 14 instruction-tuned LLMs across sizes, revealing a sizable gap between small and large models and highlighting challenges posed by longer and form-related constraints, while showing strong alignment with existing benchmarks. By releasing the dataset and providing fine-grained analyses, the work aims to drive progress in constrained generation under realistic, varied conditions.
Abstract
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
