Table of Contents
Fetching ...

Efficient Evaluation of Large Language Models via Collaborative Filtering

Xu-Xiang Zhong, Chao Yi, Han-Jia Ye

TL;DR

The paper tackles the prohibitive cost of evaluating large language models on large benchmarks by proposing a two-stage collaborative-filtering framework that can estimate a target model's task-level performance from a small subset of instances. Stage 1 selects informative test instances by computing an importance score that captures model discrimination, with personalization via similar models, while Stage 2 predicts performance on unselected instances using cross-task information and optimal transport to synthesize data, yielding an accurate $\hat{p}_i$ and $\hat{r}_i$. Across benchmarks like Open LLM Leaderboard and MMLU, the method achieves lower MAE and weighted MAE than baselines, while reducing inference costs and maintaining adaptability to new tasks. This approach enables cost-efficient, task-specific benchmarking in large model zoos and offers a practical route for continuous, personalized evaluation of LLM capabilities.

Abstract

With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.

Efficient Evaluation of Large Language Models via Collaborative Filtering

TL;DR

The paper tackles the prohibitive cost of evaluating large language models on large benchmarks by proposing a two-stage collaborative-filtering framework that can estimate a target model's task-level performance from a small subset of instances. Stage 1 selects informative test instances by computing an importance score that captures model discrimination, with personalization via similar models, while Stage 2 predicts performance on unselected instances using cross-task information and optimal transport to synthesize data, yielding an accurate and . Across benchmarks like Open LLM Leaderboard and MMLU, the method achieves lower MAE and weighted MAE than baselines, while reducing inference costs and maintaining adaptability to new tasks. This approach enables cost-efficient, task-specific benchmarking in large model zoos and offers a practical route for continuous, personalized evaluation of LLM capabilities.

Abstract

With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model's real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM's behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model's performance while largely reducing its inference overhead.

Paper Structure

This paper contains 27 sections, 14 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between Methods and Problem Setting. On the left, the red dashed line represents the real performance of a new model, and the gray area indicates the gap between the estimated performance of our method and the real performance, which is smaller. On the right is the problem setup, where the goal is to extract a subset from each task, use the new model's evaluation on it to predict performance on each task and minimize the gap between estimated and real performance.
  • Figure 2: The Paradigms of Original and Efficient LLM Benchmark. The left part illustrates the evaluation process of the Original LLM Benchmark. The right part shows the process of an efficient evaluating method, which consists of two main components: the Instance Selection Function $g$ and the Performance Estimation Function $h$. The goal of efficient evaluation methods is to design effective $g$ and $h$ to minimize the difference between real performance $\bm{p}$ and predicted performance $\hat{\bm{p}}$.
  • Figure 3: Steps in Instance Selection Process. We select instances that can easily distinguish models through an iterative process.
  • Figure 4: Steps in Performance Prediction Process. We predict performance based on optimal transport and collaborative filtering.
  • Figure 5: The Mean Absolute Error (MAE) and weighted MAE between the estimated LLM's performance by different efficient evaluation methods and the real performance of LLMs. The first and second lines represent the results on MMLU and LB, respectively.
  • ...and 3 more figures