Table of Contents
Fetching ...

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang

TL;DR

The work tackles the rising cost of evaluating large language models on large benchmarks by exploiting sparsity in the model-item score matrix. SparseEval selects a small anchor set and learns their weights via gradient descent, with an iterative anchor refinement procedure guided by Anchor Importance Score and Candidate Importance Score. The method uses spectral clustering to reveal anchor candidates and achieves accurate performance estimates with as few as 100 items, while maintaining strong ranking alignment (Kendall's tau) and low estimation error across ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande. The results imply substantial practical savings and broad applicability to efficient benchmarking of future LLMs; code is available.

Abstract

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$τ$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

TL;DR

The work tackles the rising cost of evaluating large language models on large benchmarks by exploiting sparsity in the model-item score matrix. SparseEval selects a small anchor set and learns their weights via gradient descent, with an iterative anchor refinement procedure guided by Anchor Importance Score and Candidate Importance Score. The method uses spectral clustering to reveal anchor candidates and achieves accurate performance estimates with as few as 100 items, while maintaining strong ranking alignment (Kendall's tau) and low estimation error across ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande. The results imply substantial practical savings and broad applicability to efficient benchmarking of future LLMs; code is available.

Abstract

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.
Paper Structure (20 sections, 2 theorems, 22 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 2 theorems, 22 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

In linear weight setting, let $S \in \mathbb{R}^{m\times n}$ be the model--sample score matrix and define the true overall average as where $W_a = \frac{1}{n}\mathbf{1}_n$ is the uniform weight vector. For any anchor set $A \subseteq \{1,\dots,n\}$, define the feasible linear weight class and the optimal reconstruction error Whenever $A\subseteq B$, it holds that

Figures (11)

  • Figure 1: Evidence of Evaluation Sparsity in LLM Benchmarks. We construct an item-item similarity matrix by computing the cosine similarity between item vectors. The presence of pronounced diagonal blocks along with both high intra- and inter-cluster similarity, suggests the existence of evaluation sparsity and redundancy in the benchmark.
  • Figure 2: Anchor Refinement in SparseEval. We leverage a proxy model to perform task-aware anchor refinement. By iteratively replacing items with low Anchor Importance Scores with those having high Candidate Importance Scores, we are able to obtain more representative anchors for efficient evaluation.
  • Figure 3: Error Trend on ARC. SparseEval consistently outperforms baselines.
  • Figure 4: Ablation over network architecture on GSM8K.
  • Figure 5: Ablation over training data proportion on HellaSwag.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 1: More anchors yield no larger reconstruction error
  • Proposition 2: Anchor refinement decreases $L_2$ reconstruction error
  • proof
  • proof