Evaluating Model Bias Requires Characterizing its Mistakes

Isabela Albuquerque; Jessica Schrouff; David Warde-Farley; Taylan Cemgil; Sven Gowal; Olivia Wiles

Evaluating Model Bias Requires Characterizing its Mistakes

Isabela Albuquerque, Jessica Schrouff, David Warde-Farley, Taylan Cemgil, Sven Gowal, Olivia Wiles

TL;DR

Inspired by the hypothesis testing framework, SkewSize is introduced, a principled and flexible metric that captures bias from mistakes in a model's predictions that can be used in multi-class settings or generalised to the open vocabulary setting of generative models.

Abstract

The ability to properly benchmark model performance in the face of spurious correlations is important to both build better predictors and increase confidence that models are operating as intended. We demonstrate that characterizing (as opposed to simply quantifying) model mistakes across subgroups is pivotal to properly reflect model biases, which are ignored by standard metrics such as worst-group accuracy or accuracy gap. Inspired by the hypothesis testing framework, we introduce SkewSize, a principled and flexible metric that captures bias from mistakes in a model's predictions. It can be used in multi-class settings or generalised to the open vocabulary setting of generative models. SkewSize is an aggregation of the effect size of the interaction between two categorical variables: the spurious variable representing the bias attribute and the model's prediction. We demonstrate the utility of SkewSize in multiple settings including: standard vision models trained on synthetic data, vision models trained on ImageNet, and large scale vision-and-language models from the BLIP-2 family. In each case, the proposed SkewSize is able to highlight biases not captured by other metrics, while also providing insights on the impact of recently proposed techniques, such as instruction tuning.

Evaluating Model Bias Requires Characterizing its Mistakes

TL;DR

Abstract

Paper Structure (27 sections, 6 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 6 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Method
Background
Estimating distributional bias for categorical distributions
Aggregating the Effect Size
Experiments
Controlled setting: dSprites dataset
Estimating distributional bias in multi-class classification: ImageNet
Comparing VLMs for multi-class classification across model size
SkewSize across varying models
Related Work
Discussion
Fairness metrics definitions
Computing Effect Size using Other Statistics
Estimating distributional bias in multi-class classification: DomainNet
...and 12 more sections

Figures (6)

Figure 1: Standard metrics fail to capture biases within a model. We plot the prediction counts for two models given three ground-truth classes (Writer, Doctor, Biologist). Model 1 (M1) displays similar distributions of errors for both subgroups whereas Model 2 (M2) displays "stereotypical" errors (e.g. mispredicting female Doctors for Nurses). In the table, we report accuracy (Acc), worst group accuracy (WG), Gap and their difference ($\Delta$) between M1 and M2. Only our approach (SkewSize) captures the bias in all settings.
Figure 2: Comparing models trained on ImageNet across multiple metrics. We plot SkewSize versus each accuracy-based metric for a variety of models. The results highlight that no accuracy-based metric presents a clear trend with respect to SkewSize, demonstrating it captures aspects of performance not exposed by these other metrics. Moreover, models with similar performance according to accuracy-based metrics, such as both BiT-S models, can be discriminated by SkewSize .
Figure 3: Bias exposed by SkewSize. Both domains for the socks class have similar accuracy, but a mismatch in errors indicates the model relies on spurious features of background/color.
Figure 4: Comparing effect size across classes - Blip2. Splitting effect size values in bands: 0-0.1 is a negligible effect, while 0.1-0.3, 0.3-0.5, and above 0.5 correspond to small, medium, and large, respectively. Scaling up model size with an unsupervised language model increased the amount of large effect size classes, whereas instruction-tuning decreased it.
Figure 5: DomainNet. Per-class accuracy vs. effect size. Hue indicates EO. Points in the top-right most corner of the plot indicate that even for classes where the model is most accurate systematic differences in predictions across subgroups might exist.
...and 1 more figures

Evaluating Model Bias Requires Characterizing its Mistakes

TL;DR

Abstract

Evaluating Model Bias Requires Characterizing its Mistakes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)