Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance
Bryan Etzine, Masoud Hashemi, Nishanth Madhusudhan, Sagar Davasam, Roshnee Sharma, Sathwik Tejaswi Madhusudhan, Vikas Yadav
TL;DR
The paper tackles saturation in standard NLP benchmarks by introducing EMDM, a weighted evaluation that fuses final-answer accuracy with CoT reasoning quality. It leverages guided and unguided prompting from a baseline LLM to form a 16-category transition framework and learns category weights via a constrained optimization to maximize inter-LLM separation. The resulting metric, defined as a weighted average over samples, yields substantially greater model differentiation than exact-match alone, demonstrated across MMLU, ARC-Challenge, TruthfulQA, and GSM8K. This approach enables more nuanced evaluation of reasoning depth and knowledge usage, with practical implications for benchmarking robustness and progress tracking in large language models. The framework includes ablations, sensitivity analyses of weight bounds, and guidelines for baseline selection, while acknowledging limitations in CoT-judgment accuracy and ethical considerations in data use.
Abstract
Existing benchmarks are becoming saturated and struggle to separate model performances due to factors like data contamination and advancing LLM capabilities. This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric that revitalizes benchmarks by enhancing model separation. EMDM integrates final answer and Chain-of-Thought (CoT) reasoning correctness, assigning weights based on the complexity and reasoning depth required to solve a given sample in the evaluation data. Using a baseline LLM in two setups-Unguided, where the model has no prior exposure to test samples, and Guided, where the model has prior knowledge of the desired answer-EMDM distinguishes instances of varying difficulty. The CoT and answer correctness from these setups inform an optimization objective for weight assignment, resulting in a more nuanced evaluation of model performance. Compared to the exact match (EM) metric, which achieves 17% separation on ARC-Challenge, EMDM achieves 46%, demonstrating its effectiveness in differentiating models based on reasoning and knowledge requirements.
