Table of Contents
Fetching ...

DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi, Heshaam Faili

TL;DR

DeepQuestion is introduced, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets and underscores that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

Abstract

While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets. Grounded in Bloom's taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models reveals a stark performance decline with accuracy dropping by up to 70% as tasks ascend the cognitive hierarchy. These findings underscore that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

TL;DR

DeepQuestion is introduced, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets and underscores that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

Abstract

While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets. Grounded in Bloom's taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models reveals a stark performance decline with accuracy dropping by up to 70% as tasks ascend the cognitive hierarchy. These findings underscore that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

Paper Structure

This paper contains 16 sections, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Bloom's taxonomy hierarchic
  • Figure 2: Overview of DeepQuestion framework. It begins with the selection of a random batch of questions and answers. Then, by conversation between the prompt generator and the prompt evaluator LLMs, the deep-question prompt is generated. The question generator LLM with the deep-question prompt converts each question and answer pair to the deep question and answer.
  • Figure 3: Examples of question transformations produced by the Q2S and Q2I pipelines
  • Figure 4: Evaluation of different LLMs in original and scenario-based questions
  • Figure 5: Evaluation of different LLMs in original and instruction-based questions
  • ...and 3 more figures