Table of Contents
Fetching ...

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric

Yixin Cao, Jiahao Ying, Yaoning Wang, Xipeng Qiu, Xuanjing Huang, Yugang Jiang

TL;DR

This work addresses the evaluation generalization gap for large language models by introducing Model Utilization Index (MUI), a mechanism interpretability metric that quantifies the fraction of a model's activated capabilities during inference to complement traditional performance metrics. It formalizes MUI via neuron-based and sparse autoencoder (SAE) based interpretations, and demonstrates a near-logarithmic inverse relationship—the Utility Law—between MUI and performance across diverse datasets and models. From this law, four corollaries are derived to guide training diagnostics, data contamination detection, fair model comparisons, and data diversity evaluation, with empirical validation on multiple benchmarks and open-source LLMs. The framework enables better interpretation of model capabilities, provides practical guidance for training and data curation, and supports more robust cross-model rankings beyond raw accuracy. The authors also provide code to reproduce their analyses, highlighting practical impact for researchers and practitioners in model evaluation and development.

Abstract

Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model's near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric

TL;DR

This work addresses the evaluation generalization gap for large language models by introducing Model Utilization Index (MUI), a mechanism interpretability metric that quantifies the fraction of a model's activated capabilities during inference to complement traditional performance metrics. It formalizes MUI via neuron-based and sparse autoencoder (SAE) based interpretations, and demonstrates a near-logarithmic inverse relationship—the Utility Law—between MUI and performance across diverse datasets and models. From this law, four corollaries are derived to guide training diagnostics, data contamination detection, fair model comparisons, and data diversity evaluation, with empirical validation on multiple benchmarks and open-source LLMs. The framework enables better interpretation of model capabilities, provides practical guidance for training and data curation, and supports more robust cross-model rankings beyond raw accuracy. The authors also provide code to reproduce their analyses, highlighting practical impact for researchers and practitioners in model evaluation and development.

Abstract

Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model's near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.

Paper Structure

This paper contains 35 sections, 4 theorems, 20 equations, 27 figures, 17 tables.

Key Result

Corollary 1

Training diagnostics: During model training, an increase in model utilization on some dataset may indicate a decline in other capabilities beyond the dataset.

Figures (27)

  • Figure 1: Illustration of model utility: how much effort (i.e., activated abilities) utilized to complete given tasks. Two example cases for generalizable evaluation: 1) A more capable model should achieve higher performance with less effort (lower MUI). 2) When data contamination occurs, higher performance is achieved by utilizing more effort (higher MUI), while model's overall capabilities remain unchanged or even weakened.
  • Figure 2: Relationship between performance (accuracy) and neuron-based MUI. The dashed line represents the trend line fitted using a logarithmic function. Due to space limitation, the results of MBPP can be found in Appendix \ref{['appendix: Utility Law the relationship between MUI and performance']}.
  • Figure 3: Overall MUI-performance relationship across six datasets. MMLU is excluded due to DeepSeek series model inference cost considerations. According to model utilization curve, when performance reaches 100%, the minimum MUI is around 9.77%.
  • Figure 4: Optimization directions: evolving, accumulating, coarsening, and collapsing.
  • Figure 5: MUI and Performance relationship, studied on Llama / Qwen series. We basically compare the math / code versions with the base models, to see the changes using in-domain testing and out-of-domain testing (e.g., for code version, HumanEval is in-domain and GSM8K is out-of-domain). The orange arrow from the right lower side to the left upper side denotes the coarsening direction, and the blue arrow from the left lower side to right upper side denotes the accumulating direction.
  • ...and 22 more figures

Theorems & Definitions (4)

  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Corollary 4