Table of Contents
Fetching ...

Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data

Can Wang, Dianbo Sui, Hongliang Sun, Hao Ding, Bolin Zhang, Zhiying Tu

TL;DR

This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play"utilizing only a few unlabeled samples like ICL, and utilizes four distinct meta-models to estimate the performance of LLM services.

Abstract

Large Language Model (LLM) services exhibit impressive capability on unlearned tasks leveraging only a few examples by in-context learning (ICL). However, the success of ICL varies depending on the task and context, leading to heterogeneous service quality. Directly estimating the performance of LLM services at each invocation can be laborious, especially requiring abundant labeled data or internal information within the LLM. This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play" utilizing only a few unlabeled samples like ICL. Our findings suggest that the negative log-likelihood and perplexity derived from LLM service invocation can function as effective and significant features. Based on these features, we utilize four distinct meta-models to estimate the performance of LLM services. Our proposed method is compared against unlabeled estimation baselines across multiple LLM services and tasks. And it is experimentally applied to two scenarios, demonstrating its effectiveness in the selection and further optimization of LLM services.

Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data

TL;DR

This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play"utilizing only a few unlabeled samples like ICL, and utilizes four distinct meta-models to estimate the performance of LLM services.

Abstract

Large Language Model (LLM) services exhibit impressive capability on unlearned tasks leveraging only a few examples by in-context learning (ICL). However, the success of ICL varies depending on the task and context, leading to heterogeneous service quality. Directly estimating the performance of LLM services at each invocation can be laborious, especially requiring abundant labeled data or internal information within the LLM. This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play" utilizing only a few unlabeled samples like ICL. Our findings suggest that the negative log-likelihood and perplexity derived from LLM service invocation can function as effective and significant features. Based on these features, we utilize four distinct meta-models to estimate the performance of LLM services. Our proposed method is compared against unlabeled estimation baselines across multiple LLM services and tasks. And it is experimentally applied to two scenarios, demonstrating its effectiveness in the selection and further optimization of LLM services.

Paper Structure

This paper contains 19 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Distribution of the four features and the LLM service performance, as well as the fitting curve (from two randomly selected task invocation results).
  • Figure 2: Pearson correlation coefficient of features and LLM services performance.
  • Figure 3: Procedure of our meta-model based LLM service performance estimation.
  • Figure 4: Experimental results (MAE) for different tasks of the LLM services performance estimation (our method) and baselines.
  • Figure 5: Execution performance under the settings of our method and randomly selected services or contexts.
  • ...and 1 more figures