Table of Contents
Fetching ...

Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models

Paulius Rauba, Nabeel Seedat, Max Ruiz Luyten, Mihaela van der Schaar

TL;DR

This paper instantiates the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures, which are evaluated on data using a self-falsification mechanism and shows that SMART automatically identifies more relevant and impactful failures than alternatives.

Abstract

The predominant de facto paradigm of testing ML models relies on either using only held-out data to compute aggregate evaluation metrics or by assessing the performance on different subgroups. However, such data-only testing methods operate under the restrictive assumption that the available empirical data is the sole input for testing ML models, disregarding valuable contextual information that could guide model testing. In this paper, we challenge the go-to approach of data-only testing and introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures, which are evaluated on data using a self-falsification mechanism. Through empirical evaluations in diverse settings, we show that SMART automatically identifies more relevant and impactful failures than alternatives, demonstrating the potential of CAT as a testing paradigm.

Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models

TL;DR

This paper instantiates the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures, which are evaluated on data using a self-falsification mechanism and shows that SMART automatically identifies more relevant and impactful failures than alternatives.

Abstract

The predominant de facto paradigm of testing ML models relies on either using only held-out data to compute aggregate evaluation metrics or by assessing the performance on different subgroups. However, such data-only testing methods operate under the restrictive assumption that the available empirical data is the sole input for testing ML models, disregarding valuable contextual information that could guide model testing. In this paper, we challenge the go-to approach of data-only testing and introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures, which are evaluated on data using a self-falsification mechanism. Through empirical evaluations in diverse settings, we show that SMART automatically identifies more relevant and impactful failures than alternatives, demonstrating the potential of CAT as a testing paradigm.

Paper Structure

This paper contains 52 sections, 5 equations, 10 figures, 22 tables, 3 algorithms.

Figures (10)

  • Figure 1: Overview of the SMART Testing Framework showing the four steps. All steps are automatically executed by an LLM.
  • Figure 2: SMART uses an LLM to integrate context $\mathcal{C}$, and data context $\mathcal{D}_{c}$. Relevant and likely failure hypotheses are then generated by LLM (i.e. sampling mechanism). The hypotheses are then operationalized in $\mathcal{D}$ and evaluated. In contrast, data-only methods are not guided by context and requirements, searching more exhaustively in $\mathcal{D}$ for divergent slices.
  • Figure 3: A self-falsification module within the SMART framework. A hypothesis generator $l$ generates plausible hypotheses and justifications for when the model might fail. This is operationalized with $\phi_i$ and tested against the empirical data.
  • Figure 4: Contextual-awareness in SMART reduces FPs, i.e. reducing the proportion of irrelevant features in slices. SMART is not sensitive to the number of irrelevant features, unlike data-only methods. $\downarrow$ is better
  • Figure 5: Contrasting paradigms of ML model testing along (i) Testing relevance which accounts for requirements and/or context and (ii) Degree of automation and adaptability when carrying out testing. We desire a new paradigm of ML testing to address both.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 1: Context-Aware Testing