Table of Contents
Fetching ...

Third-Party Language Model Performance Prediction from Instruction

Rahul Nadkarni, Yizhong Wang, Noah A. Smith

TL;DR

This work proposes a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time.

Abstract

Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed to be transparent about their limitations; a user may easily prompt a model with an instruction without any idea of whether the responses should be expected to be accurate, or if the system is even capable of performing the task. We propose a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time. We perform this analysis with a variety of both open and closed instruction-following models as well as multiple performance predictors, and examine the effect of various factors such as model size, number of training tasks, and prompt format. Our findings indicate that third-party performance prediction is very challenging, and much work remains in developing predictors that can automatically reveal the limitations of modern instruction-following natural language processing systems.

Third-Party Language Model Performance Prediction from Instruction

TL;DR

This work proposes a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time.

Abstract

Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed to be transparent about their limitations; a user may easily prompt a model with an instruction without any idea of whether the responses should be expected to be accurate, or if the system is even capable of performing the task. We propose a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task while assuming access only to its inputs and outputs at inference time. We perform this analysis with a variety of both open and closed instruction-following models as well as multiple performance predictors, and examine the effect of various factors such as model size, number of training tasks, and prompt format. Our findings indicate that third-party performance prediction is very challenging, and much work remains in developing predictors that can automatically reveal the limitations of modern instruction-following natural language processing systems.
Paper Structure (20 sections, 3 figures, 8 tables)

This paper contains 20 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: A diagram illustrating our complete analysis pipeline. We begin with a pretrained LM that is instruction-tuned using training tasks from chosen instruction data, resulting in an instruction-tuned model (IM). The IM is evaluated using the test tasks of the instruction data (not necessarily from the same dataset as the training tasks) and a choice of evaluation metric. Each pair of test task instruction ($\boldsymbol{x}$) and evaluation performance metric value ($y$) is used to construct the performance data, which itself is split into train, validation, and test sets. The train and validation sets are used to train another ("third party") pretrained LM to predict the performance of the IM as a regression model, resulting in the performance predictor (PP). Finally, the PP is evaluated on the test set of the performance data to determine how well it can predict the performance of the IM on unseen tasks. The sections of the diagram highlighted in blue indicate the components of the pipeline that we vary to determine their effect on performance prediction: the size of the IM, the choice of instruction data, the choice of evaluation metric, and the size and type of PP model.
  • Figure 2: Predicted vs. true metric value when using RoBERTa-large to map from task instruction to performance -- either ROUGE-L (top row) or Exact Match (bottom row) -- for various instruction-following models (columns).
  • Figure 3: Predicted vs. true loss value when using RoBERTa-large to map from task instruction to loss for various instruction-following models.