Table of Contents
Fetching ...

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

TL;DR

Benchmark Profiling addresses the misalignment between automated benchmark scores and real-world competence by diagnosing the underlying abilities benchmarks require. It defines ten cognitively informed abilities, builds dedicated diagnostic datasets, and uses gradient-based importance scores combined with targeted MLP ablations to compute the Ability Impact Score $AIS^{a}_{b}$, producing a Benchmark Profile that maps benchmark dependency on each ability. Across ten benchmarks and three open models, the framework reveals that tasks rely on multiple abilities, with diverse mixtures even among similarly labeled benchmarks, and that code benchmarks demand broad, multi-skill competence while irrelevant abilities can cause negative transfer. This mechanistic, transparent auditing approach provides a practical tool for benchmark design, model interpretation, and more human-aligned evaluation of LLM capabilities.

Abstract

Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

TL;DR

Benchmark Profiling addresses the misalignment between automated benchmark scores and real-world competence by diagnosing the underlying abilities benchmarks require. It defines ten cognitively informed abilities, builds dedicated diagnostic datasets, and uses gradient-based importance scores combined with targeted MLP ablations to compute the Ability Impact Score , producing a Benchmark Profile that maps benchmark dependency on each ability. Across ten benchmarks and three open models, the framework reveals that tasks rely on multiple abilities, with diverse mixtures even among similarly labeled benchmarks, and that code benchmarks demand broad, multi-skill competence while irrelevant abilities can cause negative transfer. This mechanistic, transparent auditing approach provides a practical tool for benchmark design, model interpretation, and more human-aligned evaluation of LLM capabilities.

Abstract

Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Paper Structure

This paper contains 45 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Top ability-benchmark links for Llama-3.1-8B-Instruct derived from its Benchmark Profile (ribbons shown only where $\mathrm{AIS}>0.05$; ribbon width $\propto$ impact).
  • Figure 2: Three-step pipeline of Benchmark Profiling. Left: We define ten cognitively motivated abilities and create a dedicated diagnostic dataset for each one. Middle: Using the diagnostic dataset, we rank the base model's parameters by gradient-based importance, and zero out (orange) the top $k$ percent associated with that ability. Right: We run the original and ability-ablated models on downstream benchmarks. Their task accuracies yield the Ability Impact Score (AIS), which quantifies how strongly the benchmark depends on the ablated ability.
  • Figure 3: Ability Impact Score radar plots for the ten benchmarks profiled on Llama-3.1-8B-Instruct. Axes are labeled with the following abbreviated abilities. Blue and red shading indicates positive and negative AIS values.
  • Figure 4: Jensen--Shannon Similarity after min-max normalization. Each bar compares two models on a single benchmark.
  • Figure 5: Interface shown to volunteer experts during item validation. Progress is indicated by a bar at the top. Annotators read the prompt, inspect the ten ability options, and enter a numeric choice.