Table of Contents
Fetching ...

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

Tao Zhang, Chenglin Zhu, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Tao Zhang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou

TL;DR

CFBench presents a large-scale, Chinese constraint-following benchmark consisting of 1000 meticulously annotated samples across 200+ real-world scenarios and 50+ NLP tasks. It introduces a 10-primary/25-subcategory constraint taxonomy and a multi-dimensional evaluation framework, including CSR, ISR, and PSR, to reflect user-centric constraints and priorities. Empirical results show current LLMs struggle with diverse constraints, with GPT-4o leading but no model dominating across all categories, highlighting significant room for improvement and the value of CFBench for guiding instruction-following enhancements. The work also outlines data-quality processes, iterative construction, and practical implications for designing evaluations that better align with real-user needs and complex instruction scenarios.

Abstract

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user's perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at https://github.com/PKU-Baichuan-MLSystemLab/CFBench

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

TL;DR

CFBench presents a large-scale, Chinese constraint-following benchmark consisting of 1000 meticulously annotated samples across 200+ real-world scenarios and 50+ NLP tasks. It introduces a 10-primary/25-subcategory constraint taxonomy and a multi-dimensional evaluation framework, including CSR, ISR, and PSR, to reflect user-centric constraints and priorities. Empirical results show current LLMs struggle with diverse constraints, with GPT-4o leading but no model dominating across all categories, highlighting significant room for improvement and the value of CFBench for guiding instruction-following enhancements. The work also outlines data-quality processes, iterative construction, and practical implications for designing evaluations that better align with real-user needs and complex instruction scenarios.

Abstract

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user's perspective. To bridge this gap, we propose CFBench, a large-scale Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code are publicly available at https://github.com/PKU-Baichuan-MLSystemLab/CFBench
Paper Structure (48 sections, 3 equations, 11 figures, 10 tables)

This paper contains 48 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Sample data from CFBench. A checklist, constraint type, requirement priority, and satisfaction constitute our evaluation criteria.
  • Figure 2: The construction pipeline and evaluation sample of CFBench. Initially, it entails the construction of the constraint system, followed by the assembly of the dataset, and culminating in the proposal of a multi-perspective user view evaluation.
  • Figure 3: The distribution of NLP tasks and domains
  • Figure 4: Different mainstream models' results under primary and secondary constraint categories.
  • Figure 5: Different mainstream models' PSR results in real-world domains and NLP task types.
  • ...and 6 more figures