Table of Contents
Fetching ...

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, Juanzi Li

TL;DR

This work introduces AgentIF, the first benchmark dedicated to evaluating how well large language models follow long, complex instructions in realistic agentic scenarios. By deriving 707 real-world agentic instructions (averaging 1,723 words with ~11.9 constraints each) from 50 tasks and annotating 8,415 constraints across formatting, semantic, and tool categories, the authors establish a comprehensive evaluation protocol using code-based, LLM-based, and hybrid verification. Evaluation across diverse models reveals that current LLMs struggle with agentic constraints, especially conditional and tool-based requirements, and performance degrades as instruction length grows. The study also analyzes failure modes, including meta-constraints, and provides actionable insights for prompt design and model improvement, accompanied by release of the data and code to spur future research.

Abstract

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

TL;DR

This work introduces AgentIF, the first benchmark dedicated to evaluating how well large language models follow long, complex instructions in realistic agentic scenarios. By deriving 707 real-world agentic instructions (averaging 1,723 words with ~11.9 constraints each) from 50 tasks and annotating 8,415 constraints across formatting, semantic, and tool categories, the authors establish a comprehensive evaluation protocol using code-based, LLM-based, and hybrid verification. Evaluation across diverse models reveals that current LLMs struggle with agentic constraints, especially conditional and tool-based requirements, and performance degrades as instruction length grows. The study also analyzes failure modes, including meta-constraints, and provides actionable insights for prompt design and model improvement, accompanied by release of the data and code to spur future research.

Abstract

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

Paper Structure

This paper contains 39 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 2: An example instruction of AgentIF.
  • Figure 3: The data construction process and evaluation workflow of AgentIF. The detailed descriptions of different constraint types are presented in \ref{['sec:constraint_taxonomy']}.
  • Figure 4: Error proportions (%) on condition and tool constraints. Figure (a) shows the errors in handling condition constraints, including condition check failure, where the model fails to recognize the condition, and constraint following failure. Figure (b) shows the errors from tool constraints, including disallowed tool usage (utilizing explicitly prohibited tools), omission of required tools (failing to employ required tools), tool name errors (invoking non-existent or incorrect tools), and parameter errors (applying incorrect or illegal arguments).
  • Figure 5: Success rates on instructions with varying length or constraint counts. Gray lines show results of the top 6 models in Figure \ref{['tab:main_exp']}, and the colored lines present their average.
  • Figure 6: Figure (a) illustrates three types of meta constraints and examples. Most meta constraints fall within the Constraint Selection category, which requires models to follow one specific constraint. Figure (b) presents the success rates of different investigated models on each meta constraint type.
  • ...and 1 more figures