Table of Contents
Fetching ...

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, Yongfeng Zhang

TL;DR

NPHardEval proposes a complexity-class grounded benchmark to rigorously assess LLMs' reasoning ability up to the $NP$-hard level across $P$, $NP$-complete, and $NP$-hard tasks. It combines 900 algorithmic questions with automated generation and verification, plus a monthly update cycle to mitigate overfitting and maintain challenge freshness. The study analyzes 12 models (10 in the abstracts) across zero-shot, robustness, and in-context learning settings, revealing systematic performance declines with increasing complexity and notable gaps between closed- and open-source models. The framework aims to provide a principled, dynamic, and scalable means to quantify and compare algorithmic reasoning in LLMs, with potential impacts on benchmark design and model evaluation in real-world decision-making contexts.

Abstract

Complex reasoning ability is one of the most important features of current LLMs, which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of Large Language Models (LLMs) is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs' performance across complex classes. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

TL;DR

NPHardEval proposes a complexity-class grounded benchmark to rigorously assess LLMs' reasoning ability up to the -hard level across , -complete, and -hard tasks. It combines 900 algorithmic questions with automated generation and verification, plus a monthly update cycle to mitigate overfitting and maintain challenge freshness. The study analyzes 12 models (10 in the abstracts) across zero-shot, robustness, and in-context learning settings, revealing systematic performance declines with increasing complexity and notable gaps between closed- and open-source models. The framework aims to provide a principled, dynamic, and scalable means to quantify and compare algorithmic reasoning in LLMs, with potential impacts on benchmark design and model evaluation in real-world decision-making contexts.

Abstract

Complex reasoning ability is one of the most important features of current LLMs, which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of Large Language Models (LLMs) is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs' performance across complex classes. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.
Paper Structure (58 sections, 2 equations, 7 figures, 2 tables)

This paper contains 58 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Computational complexity classes P, NP-complete, and NP-hard and corresponding tasks
  • Figure 2: Zero-shot model performance on the nine tasks from P to NP-Complete bottom-up.
  • Figure 3: Model performance on different complexity problems: (a) weighted accuracy (b) (weighted) failure rate. Open models are denoted in squares and close models are denoted in triangles. Trends of metrics are demonstrated for models with outstanding performances in both weighted accuracy and failure rate, including both close-source (GPT 4 Turbo and Claude 2) and open-source (Mistral-7B and Phi-2) models.
  • Figure 4: Models' performance on each complexity level. (a) GPT 4 Turbo. (b) Claude 2. (c) GPT 3.5 Turbo. (d) Claude Instant 1.2. (e) PaLM 2. (f) Yi-34b. (g) Qwen-14b. (h) Mistral-7b. (i) Phi-2. (j) MPT-30b. (k) Vicuna-13b. (l) Phi-1.5.
  • Figure 5: Models' performance on tasks across complexity levels. (a) GPT 4 Turbo. (b) Claude 2. (c) GPT 3.5 Turbo. (d) Claude Instant 1.2. (e) PaLM 2. (f) Yi-34b. (g) Qwen-14b. (h) Mistral-7b. (i) Phi-2. (j) MPT-30b. (k) Vicuna-13b. (l) Phi-1.5.
  • ...and 2 more figures