Table of Contents
Fetching ...

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang

TL;DR

ComplexBench introduces a hierarchical taxonomy for complex instructions (4 constraint types, 19 dimensions, 4 composition types) and a rule-augmented, dependency-aware evaluation to measure LLMs’ ability to follow multi-constraint instructions. It assembles a dataset of 1,150 instructions (5,306 scoring questions) and demonstrates that current LLMs struggle especially with Chain and Selection compositions, even when decomposing tasks. The framework combines LLM-based and rule-based verification to produce a DRFR score that reflects constraint satisfaction and composition dependencies. Overall, ComplexBench provides a principled, structure-aware benchmark that complements existing instruction-following evaluations and reveals gaps in modern LLMs’ handling of complex instructions.

Abstract

Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

TL;DR

ComplexBench introduces a hierarchical taxonomy for complex instructions (4 constraint types, 19 dimensions, 4 composition types) and a rule-augmented, dependency-aware evaluation to measure LLMs’ ability to follow multi-constraint instructions. It assembles a dataset of 1,150 instructions (5,306 scoring questions) and demonstrates that current LLMs struggle especially with Chain and Selection compositions, even when decomposing tasks. The framework combines LLM-based and rule-based verification to produce a DRFR score that reflects constraint satisfaction and composition dependencies. Overall, ComplexBench provides a principled, structure-aware benchmark that complements existing instruction-following evaluations and reveals gaps in modern LLMs’ handling of complex instructions.

Abstract

Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
Paper Structure (40 sections, 4 equations, 7 figures, 23 tables)

This paper contains 40 sections, 4 equations, 7 figures, 23 tables.

Figures (7)

  • Figure 1: An example instruction of ComplexBench. All constraint dimensions contained in the instruction are marked with underlines and colors, which are categorized into three constraint types in our taxonomy: Format, Semantic, and Utility. Below is the composition structure of the instruction, where these constraint dimensions are combined through three composition types: And, Chain, and Selection.
  • Figure 2: Constraint distribution of ComplexBench. The Utility constraints helpfulness and factuality possess a high proportion due to their prevalence in various instructions, which are basic requirements for high-quality outputs.
  • Figure 3: Composition types in ComplexBench. Each node is a part of an instruction. The purple node may contain other composition types, while the blue node does not. In addition to 4 basic types, the last row also shows a nested selection type.
  • Figure 4: Composition type distribution of general and professional instructions.
  • Figure 5: An exemplar evaluation process of ComplexBench. Given an instruction and its scoring questions, ComplexBench integrates the rule and LLM evaluator to verify each of them and aggregates the final score based on the dependency structure of composition types in the instruction.
  • ...and 2 more figures