Table of Contents
Fetching ...

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

TL;DR

WakenLLM investigates the Vague Perception phenomenon where LLMs output Unknown on verifiable or solvable reasoning tasks. It introduces a fine-grained benchmark and a two-stage Remind-then-Guide stimulation framework to quantify and awaken latent reasoning, reporting up to $68.53\%$ improved accuracy on VP samples without retraining. Across six LLMs and four benchmarks, the approach reveals substantial latent reasoning capacity that current baselines fail to activate, suggesting larger theoretical upper bounds on reasoning accuracy. The framework provides a principled way to assess and harness an LLM's potential, with implications for trustworthiness, evaluation protocols, and robust reasoning in AI systems.

Abstract

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

TL;DR

WakenLLM investigates the Vague Perception phenomenon where LLMs output Unknown on verifiable or solvable reasoning tasks. It introduces a fine-grained benchmark and a two-stage Remind-then-Guide stimulation framework to quantify and awaken latent reasoning, reporting up to improved accuracy on VP samples without retraining. Across six LLMs and four benchmarks, the approach reveals substantial latent reasoning capacity that current baselines fail to activate, suggesting larger theoretical upper bounds on reasoning accuracy. The framework provides a principled way to assess and harness an LLM's potential, with implications for trustworthiness, evaluation protocols, and robust reasoning in AI systems.

Abstract

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs' reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.

Paper Structure

This paper contains 53 sections, 14 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Definition of Vague Perception: when an LLM misjudges a verifiable sample as Unknown, or it is unable to tell why a sample should be categorized as unknown when facing objectively unsolvable ones. The figure presents two representative reasons attributed to this phenomenon: Fact Understanding and Reasoning Gap—highlighting 1. Limited ability to solve a problem. 2. Misunderstandings of facts versus failures to link them.
  • Figure 2: Overview of the proposed pipeline. We begin with a balanced Fact and Story-based dataset, sampling from verifiable (T/F) and unverifiable (Unknown) samples at 1 : 1 ratio. The pipeline then (1) detects cases that expose the model’s Vague Perception during reasoning and (2) applies a two-stage remedy: Stage 1 uses stimulation to re-evaluate the mislabeled samples, while Stage 2 gathers the remaining False Converting samples from Stage 1 for model self-reflection. Stimulation Types vary with data form.
  • Figure 3: Overall benchmark distribution, which contains 2,840 samples grouped into two forms: story-based (56.3%) and fact-based (43.7%). A detailed breakdown of all topics and "Else" category is provided in the Appendix \ref{['Topics']}.
  • Figure 4: Percentage of GPT-4o (left) and LLaMA-3.1-8B (Right) that Exhibit Vague Perception Across Different benchmarks. "T/F$\rightarrow$Unknown" indicates that are verifiable in principle but which the model incorrectly predicted as “Unknown" . "Unknown (Failed)" denotes cases where the model outputs “Unknown" because it lacks the capability to correctly assess verifiability.
  • Figure 5: Distribution of the root causes behind GPT-4o’s Vague Perception errors across all benchmarks (left). Qwen-2.5-7B and 72B model False and Unexcited Converting Rate comparison (Right). "FU": Fact Understanding, "RG": Reasoning Gap, and "Else" aggregates residual minor factors.
  • ...and 1 more figures