Table of Contents
Fetching ...

Collaborative Performance Prediction for Large Language Models

Qiyuan Zhang, Fuyuan Lyu, Xue Liu, Chen Ma

TL;DR

A novel framework, CPP, is introduced, which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task and facilitates a detailed analysis of factor importance, an area previously overlooked.

Abstract

Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.

Collaborative Performance Prediction for Large Language Models

TL;DR

A novel framework, CPP, is introduced, which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task and facilitates a detailed analysis of factor importance, an area previously overlooked.

Abstract

Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.
Paper Structure (45 sections, 11 equations, 15 figures, 6 tables)

This paper contains 45 sections, 11 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Framework for Collaborative Performance Prediction of Large Language Models. This schematic delineates two principal components: (1) Collaborative Data, which encompasses a score matrix illustrating the performance of various LLMs across downstream tasks, along with external descriptive factors of both models and tasks; (2) Collaborative Prediction Method, given the model and task IDs to leverage this collaborative data, enabling accurate score prediction.
  • Figure 2: Error Distribution of Predictions (Normalized Score and Rank Derived by Score) Based on the HELM Lite Leaderboard Using Matrix Factorization: We evaluate the effectiveness of Matrix Factorization (MF) using two latent factors, 7 and 10, across 2 training/validation split percentages. Accuracy is the percentage of instances where the predicted rank equals the actual rank. MAE@2 is defined as the percentage of instances where the absolute difference between the predicted and actual ranks is 2.
  • Figure 3: Distribution of Testing Coverage Across Models and Tasks. The left bar shows the number of tasks each model has been tested on; The right bar illustrates the number of models tested in each specific task.
  • Figure 4: Comparative visualization of predictive accuracy across various scoring methods. From left to right: MF, NCF, NCF with Factor Enhancement, and NCF based solely on Factors. Each plot displays the regression between predicted and actual scores, where the solid line represents the regression fit and the shaded area denotes the confidence interval (CI). A line closer to the diagonal indicates perfect prediction and higher prediction accuracy. These plots demonstrate the enhanced performance in score prediction achieved by integrating factors into the NCF method.
  • Figure 5: Comparison of the predictive performance of collaborative performance prediction (CPP) versus traditional scaling laws (SL) for LLMs: (a) CPP-0, with no prior testing information, and (b) CPP-2, with prior testing on two tasks.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1