Automated Trustworthiness Testing for Machine Learning Classifiers

Steven Cho; Seaton Cousins-Baxter; Stefano Ruberto; Valerio Terragni

Automated Trustworthiness Testing for Machine Learning Classifiers

Steven Cho, Seaton Cousins-Baxter, Stefano Ruberto, Valerio Terragni

TL;DR

The paper introduces TOWER, an automated trustworthiness oracle that judges text classifier predictions by evaluating the plausibility of explanations through a word-embedding ensemble. It formalizes trustworthiness in ML and leverages LIME explanations to generate words linked to model decisions, then assesses semantic relatedness to the predicted class. Through unsupervised tuning on noisy data and human-ground-truth evaluation, the approach demonstrates that explanations relating to the class can reflect trustworthiness in synthetic setups but reveals gaps when compared to human judgments, indicating the need for more robust class representations and larger datasets. Overall, TOWER marks a promising direction for automated trustworthiness testing, with clear avenues for refinement and broader applicability.

Abstract

Machine Learning (ML) has become an integral part of our society, commonly used in critical domains such as finance, healthcare, and transportation. Therefore, it is crucial to evaluate not only whether ML models make correct predictions but also whether they do so for the correct reasons, ensuring our trust that will perform well on unseen data. This concept is known as trustworthiness in ML. Recently, explainable techniques (e.g., LIME, SHAP) have been developed to interpret the decision-making processes of ML models, providing explanations for their predictions (e.g., words in the input that influenced the prediction the most). Assessing the plausibility of these explanations can enhance our confidence in the models' trustworthiness. However, current approaches typically rely on human judgment to determine the plausibility of these explanations. This paper proposes TOWER, the first technique to automatically create trustworthiness oracles that determine whether text classifier predictions are trustworthy. It leverages word embeddings to automatically evaluate the trustworthiness of a model-agnostic text classifiers based on the outputs of explanatory techniques. Our hypothesis is that a prediction is trustworthy if the words in its explanation are semantically related to the predicted class. We perform unsupervised learning with untrustworthy models obtained from noisy data to find the optimal configuration of TOWER. We then evaluated TOWER on a human-labeled trustworthiness dataset that we created. The results show that TOWER can detect a decrease in trustworthiness as noise increases, but is not effective when evaluated against the human-labeled dataset. Our initial experiments suggest that our hypothesis is valid and promising, but further research is needed to better understand the relationship between explanations and trustworthiness issues.

Automated Trustworthiness Testing for Machine Learning Classifiers

TL;DR

Abstract

Paper Structure (15 sections, 4 figures, 3 tables)

This paper contains 15 sections, 4 figures, 3 tables.

Introduction
Preliminaries and Problem Formulation
TOWER
Input
Explanations
Word Embeddings Ensemble
Trustworthiness Oracle Outcome
Evaluation
Implementation
RQ1 - Unsupervised training on noisy models
RQ2 - Human-based evaluation
Discussion
Threats to Validity
Related Work
Conclusion

Figures (4)

Figure 1: Example of "untrustworthy" prediction from LIME paper lime. Algorithm 2 based the prediction on irrelevant words: “Posting”, “Host”, “Re”, and “Nntp”
Figure 2: Logical architecture of TOWER. Given an input instance $x$, a ML classifier model under test $M$, and the predicted class $c$, TOWER automatically judges if the prediction $M(x) = c$ is trustworthy, untrustworthy, or undefined (non enough information/confidence to make a decision).
Figure 3: Example charts of trustworthiness against noise level for each model using optimal configuration and $t/(t+u)$ slope calculation for the 20 Newsgroup dataset. Adjusted slope used for all except label noise.
Figure 4: Confusion matrix. Labels 0, 1, and 2 are trustworthy, untrustworthy, and undefined, respectively.

Theorems & Definitions (2)

Definition 1
Definition 2

Automated Trustworthiness Testing for Machine Learning Classifiers

TL;DR

Abstract

Automated Trustworthiness Testing for Machine Learning Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (2)