Table of Contents
Fetching ...

BenTo: Benchmark Task Reduction with In-Context Transferability

Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou

TL;DR

This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality and proposes a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL).

Abstract

Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.

BenTo: Benchmark Task Reduction with In-Context Transferability

TL;DR

This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality and proposes a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL).

Abstract

Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: LEFT: In-context Transferability (ICT) reveals the clusters of benchmark tasks. We apply spectral clustering to ICT (arcs) between MMLU tasks (nodes), whose color denotes the cluster it belongs to. The discovered clusters are associated with explainable themes. The theme and tasks of each cluster are listed around the chord graph. Only the top-7% arcs with the highest ICT values are shown in the graph, among which intra-cluster arcs are much more than inter-cluster arcs, implying a "sparse" topology captured by ICT. RIGHT: Evaluation accuracy of task reduction methods. Each method selects 3 out of the 57 tasks in MMLU to evaluate 9 LLMs (axes). The plot reports $1-|\sigma-\sigma^*|/\sigma^*$ in log-scale where $\sigma$ and $\sigma^*$ are the evaluation metrics on the reduced-benchmark and full-benchmark, respectively. Our method ( BenTo-le) achieves 97% evaluation accuracy on average. The grey band reports the random selection baseline's mean± standard variation. All baselines are defined in \ref{['sec:exp']}. \ref{['tab:result-mmlu']} reports the result when selecting different number of tasks.
  • Figure 2: Similarity matrices.
  • Figure 3: Difference ($\Delta$) in NRMSE between $S$ ("sim") and $S'$ ("le") when used to select different numbers of tasks (x-axis). Larger $\Delta$ indicates the "le" variant produces a better reduced benchmark than "sim". For both BenTo and BM25, "le" is better ($\Delta\geq 0$) for smaller $k$ while "sim" is better ($\Delta\leq 0$) for larger $k$.
  • Figure 4: Ablation study on facility location (FL) vs. K-medoids: we report the best NRMSE (lower is better) achieved by each method on MMLU. KM denotes K-medoids. KM-raw, KM-sim and KM-le denote K-medoids on the raw feature matrix $A$, similarity matrix $S$ and $S'$ respectively.