Table of Contents
Fetching ...

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

Zhiyuan Li, Yi Chang, Yuan Wu

TL;DR

Think-Bench is introduced, a benchmark designed to evaluate the reasoning efficiency of large reasoning models (LRMs) and reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains.

Abstract

Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

TL;DR

Think-Bench is introduced, a benchmark designed to evaluate the reasoning efficiency of large reasoning models (LRMs) and reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains.

Abstract

Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.

Paper Structure

This paper contains 24 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The performance of various LRMs on Think-Bench. The results suggest that these prominent LRMs face a challenge of overthinking.
  • Figure 2: Overview of Think-Bench. Our benchmark contains a comprehensive efficiency evaluation framework with curated datasets across three categories.
  • Figure 3: Category and Subcategory Distribution of Think-Bench.
  • Figure 4: Illustration of Thinking Efficiency and CoT Quality Evaluation.
  • Figure 5: Example of Thinking Process Analysis in a LRM.
  • ...and 3 more figures