Table of Contents
Fetching ...

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu, Yuan Yuan, Siu-Ming Yiu

TL;DR

ConInstruct presents a novel benchmark to evaluate large language models on detecting and resolving conflicting constraints within user instructions. It constructs a diverse dataset with six constraint types across six tasks, introducing nine conflict types and multiple conflicts per instruction to study detection and resolution behaviors systematically. The study finds proprietary LLMs excel at detecting conflicts but often fail to explicitly communicate them, while many open-source models lag; prompt engineering can guide resolution behavior but does not fully solve the transparency issue. These results reveal a critical gap in instruction-following LLMs and point to designing models that better notify users of conflicts and solicit clarifications to improve reliability in real-world prompts.

Abstract

Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

TL;DR

ConInstruct presents a novel benchmark to evaluate large language models on detecting and resolving conflicting constraints within user instructions. It constructs a diverse dataset with six constraint types across six tasks, introducing nine conflict types and multiple conflicts per instruction to study detection and resolution behaviors systematically. The study finds proprietary LLMs excel at detecting conflicts but often fail to explicitly communicate them, while many open-source models lag; prompt engineering can guide resolution behavior but does not fully solve the transparency issue. These results reveal a critical gap in instruction-following LLMs and point to designing models that better notify users of conflicts and solicit clarifications to improve reliability in real-world prompts.

Abstract

Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.

Paper Structure

This paper contains 83 sections, 2 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: An instruction with conflicts from ConInstruct, where text in green and red indicate conflicts between phrase constraints and length constraints, respectively. The lower part of the figure presents two responses from GPT-4o and Claude-3.5-Sonnet for the instruction.
  • Figure 2: The construction process of the ConInstruct Benchmark: We start with a seed instruction, then add constraints to it. Finally, we introduce conflicts into the expanded instructions. Due to space limits, we show only four conflicts. In each conflict, the first constraint is newly added, while the second comes from the original instruction.
  • Figure 3: Conflict detection results of LLMs for instructions with varying numbers of conflicts (i.e., instructions in $\mathcal{I}_k$). The x-axis denotes the number of conflicts per instruction. The reported metric is Recall.
  • Figure 4: Distributions of conflict resolution behaviors exhibited by different LLMs when responding to instructions with varying numbers of conflicts. The x-axis denotes the number of conflicts per instruction.
  • Figure 5: CSR results of various LLMs across different constraint types.
  • ...and 2 more figures