AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, Juanzi Li
TL;DR
This work introduces AgentIF, the first benchmark dedicated to evaluating how well large language models follow long, complex instructions in realistic agentic scenarios. By deriving 707 real-world agentic instructions (averaging 1,723 words with ~11.9 constraints each) from 50 tasks and annotating 8,415 constraints across formatting, semantic, and tool categories, the authors establish a comprehensive evaluation protocol using code-based, LLM-based, and hybrid verification. Evaluation across diverse models reveals that current LLMs struggle with agentic constraints, especially conditional and tool-based requirements, and performance degrades as instruction length grows. The study also analyzes failure modes, including meta-constraints, and provides actionable insights for prompt design and model improvement, accompanied by release of the data and code to spur future research.
Abstract
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.
