Table of Contents
Fetching ...

Can Large Language Models Understand Real-World Complex Instructions?

Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao

TL;DR

The paper introduces CELLO, a benchmark designed to evaluate how well large language models follow complex real-world instructions, which combine multi-task constraints and heterogeneous inputs. It defines eight instruction features, uses a two-stage data construction process from real-world prompts and logs, and pairs the dataset with four evaluation criteria and automatic metrics to capture open-ended instruction understanding. Extensive experiments across 34 models—spanning Chinese- and English-oriented families—reveal that instruction-following ability depends on factors like instruction tuning quality, language orientation, and context handling, with notable gaps relative to large closed models. The work provides a practical, discriminative tool for evaluating and guiding the development of LLMs toward robust real-world instruction following, and the data and resources are publicly available.

Abstract

Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.

Can Large Language Models Understand Real-World Complex Instructions?

TL;DR

The paper introduces CELLO, a benchmark designed to evaluate how well large language models follow complex real-world instructions, which combine multi-task constraints and heterogeneous inputs. It defines eight instruction features, uses a two-stage data construction process from real-world prompts and logs, and pairs the dataset with four evaluation criteria and automatic metrics to capture open-ended instruction understanding. Extensive experiments across 34 models—spanning Chinese- and English-oriented families—reveal that instruction-following ability depends on factors like instruction tuning quality, language orientation, and context handling, with notable gaps relative to large closed models. The work provides a practical, discriminative tool for evaluating and guiding the development of LLMs toward robust real-world instruction following, and the data and resources are publicly available.

Abstract

Large language models (LLMs) can understand human instructions, showing their potential for pragmatic applications beyond traditional NLP tasks. However, they still struggle with complex instructions, which can be either complex task descriptions that require multiple tasks and constraints, or complex input that contains long context, noise, heterogeneous information and multi-turn format. Due to these features, LLMs often ignore semantic constraints from task descriptions, generate incorrect formats, violate length or sample count constraints, and be unfaithful to the input text. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions, as they are close-ended and simple. To bridge this gap, we propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically. We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios. We also establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained. We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
Paper Structure (24 sections, 4 equations, 4 figures, 11 tables)

This paper contains 24 sections, 4 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Existing benchmarks generally contain simple and common instructions. However, the complex instructions in real-world scenarios are a composition of multiple features, such as constraints on the output format, number of output samples, key elements of the output, and heterogeneity of input texts in the given example. The understanding of complex instructions poses challenges to current models.
  • Figure 2: The framework of our benchmark design. We first establish a framework containing eight features for complex instructions, then construct an evaluation dataset covering nine tasks, and finally propose four evaluation criteria along with their corresponding metrics.
  • Figure 3: The performance of models on mainstream benchmarks.
  • Figure 4: The performance of LLMs grounded on the same base model touvron2023llama regarding different tasks and criteria.