Table of Contents
Fetching ...

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

Saeid Asgari Taghanaki, Aliasgahr Khani, Amir Khasahmadi

TL;DR

This work introduces MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs, and introduces novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias.

Abstract

Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of six state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

TL;DR

This work introduces MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs, and introduces novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias.

Abstract

Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of six state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.
Paper Structure (10 sections, 4 equations, 4 figures, 1 table)

This paper contains 10 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 2: Accuracy on the three modified groups of questions. The amount of drop w.r.t original MMLU-Pro is written on the bars.
  • Figure 3: Shortcut Selection Ratio for True Positive Pairs in MMLU-Pro+
  • Figure 4: Error Analysis: Correct Pair Identification (CPI) in MMLU-Pro+. The numbers on the bars represent the CPI ratio values. A higher CPI ratio indicates better performance in distinguishing correct answer pairs from incorrect ones.
  • Figure 5: True Positive Pair Samples from math and computer science categories with model predictions.