Efficient Evaluation of Large Language Models via Collaborative Filtering
Xu-Xiang Zhong, Chao Yi, Han-Jia Ye
TL;DR
The paper tackles the prohibitive cost of evaluating large language models on large benchmarks by proposing a two-stage collaborative-filtering framework that can estimate a target model's task-level performance from a small subset of instances. Stage 1 selects informative test instances by computing an importance score that captures model discrimination, with personalization via similar models, while Stage 2 predicts performance on unselected instances using cross-task information and optimal transport to synthesize data, yielding an accurate $\hat{p}_i$ and $\hat{r}_i$. Across benchmarks like Open LLM Leaderboard and MMLU, the method achieves lower MAE and weighted MAE than baselines, while reducing inference costs and maintaining adaptability to new tasks. This approach enables cost-efficient, task-specific benchmarking in large model zoos and offers a practical route for continuous, personalized evaluation of LLM capabilities.
Abstract
With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.
