Table of Contents
Fetching ...

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu

TL;DR

The paper addresses the vulnerability of large language models to uncritically accepting flawed premises. It introduces PCBench, a benchmark that injects four error types across three difficulty levels and three problem variants to assess Premise Critique Ability in 15 LLMs. Key findings show limited autonomous critique, sensitivity to error type and difficulty, inconsistent links between reasoning and critique, and pronounced overthinking when premises are flawed. The work emphasizes the need to develop proactive premise critique as a foundational capability for reliable, human-centric AI systems.

Abstract

Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.

Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

TL;DR

The paper addresses the vulnerability of large language models to uncritically accepting flawed premises. It introduces PCBench, a benchmark that injects four error types across three difficulty levels and three problem variants to assess Premise Critique Ability in 15 LLMs. Key findings show limited autonomous critique, sensitivity to error type and difficulty, inconsistent links between reasoning and critique, and pronounced overthinking when premises are flawed. The work emphasizes the need to develop proactive premise critique as a foundational capability for reliable, human-centric AI systems.

Abstract

Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.

Paper Structure

This paper contains 37 sections, 5 equations, 17 figures, 18 tables.

Figures (17)

  • Figure 1: Illustration of how LLMs handle a query containing contradictory premises about book percentages. The example presents conflicting statements regarding the proportion of German books and contrasts two model behaviors: one that passively accepts the flawed premises, and another that actively identifies and reports the inconsistency. This highlights the importance of Premise Critique Ability, which refers to the capacity to detect and articulate flaws in the input premises.
  • Figure 2: An overview of the dataset construction and the evaluation pipeline.
  • Figure 3: Proactive Premise Critique Rates for the four Different Error Categories
  • Figure 4: Proactive Premise Critique Rates at three difficulty levels
  • Figure 5: An illustrative case of Deepseek-R1's response to a Contradictory Inference Insertion Question. The red text marks the contradictory segment in the question. The blue text shows that the model successfully identifies the contradiction through iterative reasoning. The orange text indicates that the model makes autonomous decisions without user guidance. In its final answer, the model relies on its own assumptions without offering critical feedback, revealing a lack of Premise Critique Ability.
  • ...and 12 more figures