Table of Contents
Fetching ...

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

Bryan Etzine, Masoud Hashemi, Nishanth Madhusudhan, Sagar Davasam, Roshnee Sharma, Sathwik Tejaswi Madhusudhan, Vikas Yadav

TL;DR

The paper tackles saturation in standard NLP benchmarks by introducing EMDM, a weighted evaluation that fuses final-answer accuracy with CoT reasoning quality. It leverages guided and unguided prompting from a baseline LLM to form a 16-category transition framework and learns category weights via a constrained optimization to maximize inter-LLM separation. The resulting metric, defined as a weighted average over samples, yields substantially greater model differentiation than exact-match alone, demonstrated across MMLU, ARC-Challenge, TruthfulQA, and GSM8K. This approach enables more nuanced evaluation of reasoning depth and knowledge usage, with practical implications for benchmarking robustness and progress tracking in large language models. The framework includes ablations, sensitivity analyses of weight bounds, and guidelines for baseline selection, while acknowledging limitations in CoT-judgment accuracy and ethical considerations in data use.

Abstract

Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

TL;DR

The paper tackles saturation in standard NLP benchmarks by introducing EMDM, a weighted evaluation that fuses final-answer accuracy with CoT reasoning quality. It leverages guided and unguided prompting from a baseline LLM to form a 16-category transition framework and learns category weights via a constrained optimization to maximize inter-LLM separation. The resulting metric, defined as a weighted average over samples, yields substantially greater model differentiation than exact-match alone, demonstrated across MMLU, ARC-Challenge, TruthfulQA, and GSM8K. This approach enables more nuanced evaluation of reasoning depth and knowledge usage, with practical implications for benchmarking robustness and progress tracking in large language models. The framework includes ablations, sensitivity analyses of weight bounds, and guidelines for baseline selection, while acknowledging limitations in CoT-judgment accuracy and ethical considerations in data use.

Abstract

Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.

Paper Structure

This paper contains 20 sections, 6 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: "Enhanced Model Differentiation Metric (EMDM)" -- for each benchmark a baseline LLM is used to 1) generate guided and unguided responses, 2) create data categories based on the correctness of the answer and CoT facts and reasoning (the transition matrix), and 3) assign weights to each of the categories ($w_{g_k}$, see section \ref{['sec:methods']}) and calculate the weighted average.
  • Figure 2: ARC-Challenge sample distribution with Mistal 7B on Unguided & Guided prompt responses.
  • Figure 3: The average exact match (EM) accuracy in different sample groups of ARC-Challenge, with Mistral7B as the baseline. The groups with 0 or 1 sample are not shown.
  • Figure 4: Kendall's Tau correlation between (Left) GPT-3.5 and Mistral 7B-Instruct and (Right) Qwen2-1.5B and GPT3.5. The numbers on top and right show the marginal count of the samples in each category. Ones with less than 10 samples are removed (which means the margins may differ due to the removal of those cells). The cells are Guided (Answer Current/Incorrect, CoT Correct/Incorrect)-Unguided (Answer Correct/Incorrect, CoT Correct/Incorrect).
  • Figure 5: Effect of $U$ the upper bound in weight optimization, Eqn. \ref{['eqn:pairwise_weight_optimization']}, on model separation.
  • ...and 3 more figures