Table of Contents
Fetching ...

Evaluation of human-model prediction difference on the Internet Scale of Data

Weitang Liu, Ying Wai Li, Yuelei Li, Zihan Wang, Yi-Zhuang You, Jingbo Shang

TL;DR

OmniInput is a novel approach to evaluate and compare NNs by the PR of an input space that reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated.

Abstract

Evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. It would be beneficial if we could evaluate the difference between human annotation and model prediction for an internet number of inputs, or more generally, for an input space that enumeration is computationally impractical. Traditional model evaluation methods rely on precision and recall (PR) as metrics, which are typically estimated by comparing human annotations with model predictions on a specific dataset. This is feasible because enumerating thousands of test inputs is manageable. However, estimating PR across a large input space is challenging because enumeration becomes computationally infeasible. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space. OmniInput is distinctive from previous works as its estimated PR reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated. We empirically validate our method within an enumerable input space, and our experiments demonstrate that OmniInput can effectively estimate and compare precision and recall for (large) language models within a broad input space that is not enumerable.

Evaluation of human-model prediction difference on the Internet Scale of Data

TL;DR

OmniInput is a novel approach to evaluate and compare NNs by the PR of an input space that reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated.

Abstract

Evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. It would be beneficial if we could evaluate the difference between human annotation and model prediction for an internet number of inputs, or more generally, for an input space that enumeration is computationally impractical. Traditional model evaluation methods rely on precision and recall (PR) as metrics, which are typically estimated by comparing human annotations with model predictions on a specific dataset. This is feasible because enumerating thousands of test inputs is manageable. However, estimating PR across a large input space is challenging because enumeration becomes computationally infeasible. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space. OmniInput is distinctive from previous works as its estimated PR reflects the estimation of the differences between human annotation and model prediction in the input space which is usually too huge to be enumerated. We empirically validate our method within an enumerable input space, and our experiments demonstrate that OmniInput can effectively estimate and compare precision and recall for (large) language models within a broad input space that is not enumerable.
Paper Structure (19 sections, 8 equations, 6 figures, 1 table)

This paper contains 19 sections, 8 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: An overview of our novel OmniInput framework. (a) Use an efficient sampler to obtain the output distribution $\rho(z)$ and the sampled inputs; (b) Annotate the sampled inputs; (c) Estimate the precision and recall at different threshold $\lambda$ that distinguishes different classes. $r(z)$ denotes the precision of the model within the bin of output value $z$; (d) Construct a precision-recall curve.
  • Figure 2: Toy example where enumeration is affordable. The bar plots compare the ground truth and the sampled precision per bins $r(z)$. The line plots compare the ground truth and the sampled output distributions $\rho(z)$.
  • Figure 3: PR for Language models.
  • Figure 4: Sampled inputs of SST2 with sentence length 66.
  • Figure 5: Sampled inputs of SST2 with sentence length 10.
  • ...and 1 more figures