SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang
TL;DR
The work tackles the rising cost of evaluating large language models on large benchmarks by exploiting sparsity in the model-item score matrix. SparseEval selects a small anchor set and learns their weights via gradient descent, with an iterative anchor refinement procedure guided by Anchor Importance Score and Candidate Importance Score. The method uses spectral clustering to reveal anchor candidates and achieves accurate performance estimates with as few as 100 items, while maintaining strong ranking alignment (Kendall's tau) and low estimation error across ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande. The results imply substantial practical savings and broad applicability to efficient benchmarking of future LLMs; code is available.
Abstract
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$τ$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.
