Table of Contents
Fetching ...

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs

Masoud Hashemi, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudhan, Jishnu Sethumadhavan Nair, Aman Tiwari, Vikas Yadav

TL;DR

DNR Bench presents a targeted benchmark to expose over-reasoning in reasoning LLMs using 150 adversarial prompts across five failure modes. It demonstrates that many reasoning-trained models generate far more tokens than necessary and struggle on tasks where non-reasoning baselines perform well, even with explicit instructions. The study introduces token-efficiency metrics and human validation, showing that instruction guidance can improve but not fully fix over-reasoning, and calls for adaptive inference strategies to balance accuracy and efficiency. Overall, the work highlights critical weaknesses in current RLMs and provides a framework for evaluating and mitigating excessive reasoning in real-world deployments.

Abstract

Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce Dont Reason Bench (DNR Bench), a new benchmark designed to evaluate LLMs ability to robustly understand the tricky reasoning triggers and avoiding unnecessary generation. DNR Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many of the recent prominent LLMs. DNR Bench tests models abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, Claude-3.7-sonnet and compare them against a powerful non-reasoning model, e.g., GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs

TL;DR

DNR Bench presents a targeted benchmark to expose over-reasoning in reasoning LLMs using 150 adversarial prompts across five failure modes. It demonstrates that many reasoning-trained models generate far more tokens than necessary and struggle on tasks where non-reasoning baselines perform well, even with explicit instructions. The study introduces token-efficiency metrics and human validation, showing that instruction guidance can improve but not fully fix over-reasoning, and calls for adaptive inference strategies to balance accuracy and efficiency. Overall, the work highlights critical weaknesses in current RLMs and provides a framework for evaluating and mitigating excessive reasoning in real-world deployments.

Abstract

Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce Dont Reason Bench (DNR Bench), a new benchmark designed to evaluate LLMs ability to robustly understand the tricky reasoning triggers and avoiding unnecessary generation. DNR Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many of the recent prominent LLMs. DNR Bench tests models abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, Claude-3.7-sonnet and compare them against a powerful non-reasoning model, e.g., GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.

Paper Structure

This paper contains 18 sections, 9 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Changes in model accuracy across different instructions. DS: DeepSeek, L: Llama 3.1, Q: Qwen 2.5.
  • Figure 2: Accuracy across different data categories and instructions.
  • Figure 3: Changes in token count across different instructions. DS: DeepSeek, L: Llama 3.1, Q: Qwen 2.5.
  • Figure 4: Mean token count across different data categories and instructions.
  • Figure 5: Average token inefficiency $I_{token}$, eqn. \ref{['eqn:metric']}, for different data categires averaged across all models.
  • ...and 12 more figures