Table of Contents
Fetching ...

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, Zhifang Sui

TL;DR

The paper investigates why large language models can solve hard problems yet fail on easier ones, a phenomenon termed hard-to-easy inconsistency. It introduces ConsisEval, a benchmark that pairs easy and hard questions across Mathematics, Code, and Instruction-following domains, and defines the Consistency Score ($CS$) and Relative Consistency Score ($RCS$) to quantify this behavior; it also details two probability-estimation methods to compute these metrics. Through extensive experiments on both closed- and open-source LLMs, the authors show GPT-4 achieves the highest $CS$ (≈92.2%), while noting that even top models exhibit targeted inconsistencies, and that hard data and hard demonstrations generally enhance consistency. The work provides a reproducible framework, datasets, and code to study LLM reliability, offering practical guidance for improving trustworthiness and informing future model design and evaluation. Overall, ConsisEval reveals that stronger models tend to be more consistent but not universally, and that dedicated training and evaluation strategies are needed to mitigate hard-to-easy inconsistencies in real-world AI systems.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

TL;DR

The paper investigates why large language models can solve hard problems yet fail on easier ones, a phenomenon termed hard-to-easy inconsistency. It introduces ConsisEval, a benchmark that pairs easy and hard questions across Mathematics, Code, and Instruction-following domains, and defines the Consistency Score () and Relative Consistency Score () to quantify this behavior; it also details two probability-estimation methods to compute these metrics. Through extensive experiments on both closed- and open-source LLMs, the authors show GPT-4 achieves the highest (≈92.2%), while noting that even top models exhibit targeted inconsistencies, and that hard data and hard demonstrations generally enhance consistency. The work provides a reproducible framework, datasets, and code to study LLM reliability, offering practical guidance for improving trustworthiness and informing future model design and evaluation. Overall, ConsisEval reveals that stronger models tend to be more consistent but not universally, and that dedicated training and evaluation strategies are needed to mitigate hard-to-easy inconsistencies in real-world AI systems.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.
Paper Structure (56 sections, 23 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 56 sections, 23 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: A hard-to-easy inconsistency case of LLMs. A counter-intuitive phenomenon occurs when an LLM, which can solve a harder problem, surprisingly goes wrong on an easier problem.
  • Figure 2: The hard data collection process of ConsisEval. An easy datum is fed into GPT-4 with a well-designed prompt and multiple hard data candidates are sampled. Human annotators select the one of best quality, then check and revise the sample to make it fit our criteria.
  • Figure 3: Venn diagram for consistent/inconsistent models in complete probability space. The orange, red circles and their overlap area denote the probability of a model correctly answering easy questions, hard questions, and both respectively. the overlap area of consistent models is much larger than that of inconsistent models.
  • Figure 4: Visualized expression of relative consistency score.
  • Figure 5: Relative consistency results in code domain (shown in ascending order of CS). Except for showing RCS for each evaluated model in a bar, we also show CS, upper and lower bounds of CS in lines of different colors for comparison.
  • ...and 7 more figures