Table of Contents
Fetching ...

WildIFEval: Instruction Following in the Wild

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

TL;DR

WildIFEval introduces a large-scale, real-world benchmark for instruction following under multiple constraints. It aggregates 7,523 user-generated constrained tasks (24,731 constraints) from LMSYS-Chat-1M, with a top-down decomposition of constraints and a detailed taxonomy and diversity analysis. The paper benchmarks 14 instruction-tuned LLMs across sizes, revealing a sizable gap between small and large models and highlighting challenges posed by longer and form-related constraints, while showing strong alignment with existing benchmarks. By releasing the dataset and providing fine-grained analyses, the work aims to drive progress in constrained generation under realistic, varied conditions.

Abstract

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

WildIFEval: Instruction Following in the Wild

TL;DR

WildIFEval introduces a large-scale, real-world benchmark for instruction following under multiple constraints. It aggregates 7,523 user-generated constrained tasks (24,731 constraints) from LMSYS-Chat-1M, with a top-down decomposition of constraints and a detailed taxonomy and diversity analysis. The paper benchmarks 14 instruction-tuned LLMs across sizes, revealing a sizable gap between small and large models and highlighting challenges posed by longer and form-related constraints, while showing strong alignment with existing benchmarks. By releasing the dataset and providing fine-grained analyses, the work aims to drive progress in constrained generation under realistic, varied conditions.

Abstract

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

Paper Structure

This paper contains 33 sections, 1 equation, 14 figures, 1 table.

Figures (14)

  • Figure 1: WildIFEval description. At the top is an example for a constrained generation task, and its decomposition into constraints. In evaluation (bottom), the judge decides whether each of the constraints is fulfilled.
  • Figure 2: Analysis of constraints in WildIFEval. (a) Distribution of constraint types. (b) A tSNE projection tSNE of the embeddings of constraints, colored by their type. For convenience, we randomly subsample 1k data points. We observe some red, brown, and yellow clusters, corresponding to Format and Structure, Length, and Style and Tone constraints, aligning with the generic nature of these types. This is in contrast to content-oriented types like Focus/Emphasis and Include/Avoid (green and purple), which are more spread out.
  • Figure 3: Relative co-occurrence (PMI) of constraint categories within tasks. Values above $0$ indicate that constraints co-occur more than expected by their overall type frequencies.
  • Figure 4: Task and constraint characteristics in WildIFEval. (a) Domain distribution of tasks. (b) Lexical diversity of constraint phrasing (opening verbs).
  • Figure 5: Strict scores on WildIFEval. For each model, the figure reports the proportion of tasks in which all constraints were fulfilled (strict score). Soft scores are shown in Figure \ref{['fig:soft']} in the Appendix. Statistical significance between model pairs (McNemar tests) is reported in Figure \ref{['fig:stat']} in Appendix.
  • ...and 9 more figures