Foundations for Unfairness in Anomaly Detection -- Case Studies in Facial Imaging Data

Michael Livanos; Ian Davidson

Foundations for Unfairness in Anomaly Detection -- Case Studies in Facial Imaging Data

Michael Livanos, Ian Davidson

TL;DR

The paper tackles fairness in deep anomaly detection applied to facial imaging by asking who is disadvantaged and why. It compares autoencoder-based and one-class AD methods on CelebA and LFW, introducing the Disparate Impact Ratio ($DIR$) and four foundational factors—Incompressibility, Sample Size Bias ($SSB$), Spurious Feature Variance (SFV), and Label Attribution Noise (LAN)—that may drive unfair outcomes. Through extensive experiments and hypothesis testing, it demonstrates that while most groups are largely treated fairly, fairness can breakdown due to the interaction between data and algorithm, with a four-property model needed to explain observed unfairness. The findings provide a nuanced framework for understanding and mitigating unsupervised unfairness in facial AD, and point to future directions such as dataset-aware remediation, threshold adjustment, and group-specific detectors.

Abstract

Deep anomaly detection (AD) is perhaps the most controversial of data analytic tasks as it identifies entities that are then specifically targeted for further investigation or exclusion. Also controversial is the application of AI to facial imaging data. This work explores the intersection of these two areas to understand two core questions: "Who" these algorithms are being unfair to and equally important "Why". Recent work has shown that deep AD can be unfair to different groups despite being unsupervised with a recent study showing that for portraits of people: men of color are far more likely to be chosen to be outliers. We study the two main categories of AD algorithms: autoencoder-based and single-class-based which effectively try to compress all the instances with those that can not be easily compressed being deemed to be outliers. We experimentally verify sources of unfairness such as the under-representation of a group (e.g. people of color are relatively rare), spurious group features (e.g. men are often photographed with hats), and group labeling noise (e.g. race is subjective). We conjecture that lack of compressibility is the main foundation and the others cause it but experimental results show otherwise and we present a natural hierarchy amongst them.

Foundations for Unfairness in Anomaly Detection -- Case Studies in Facial Imaging Data

TL;DR

) and four foundational factors—Incompressibility, Sample Size Bias (

), Spurious Feature Variance (SFV), and Label Attribution Noise (LAN)—that may drive unfair outcomes. Through extensive experiments and hypothesis testing, it demonstrates that while most groups are largely treated fairly, fairness can breakdown due to the interaction between data and algorithm, with a four-property model needed to explain observed unfairness. The findings provide a nuanced framework for understanding and mitigating unsupervised unfairness in facial AD, and point to future directions such as dataset-aware remediation, threshold adjustment, and group-specific detectors.

Abstract

Paper Structure (15 sections, 10 equations, 9 figures, 6 tables)

This paper contains 15 sections, 10 equations, 9 figures, 6 tables.

Introduction
Background and Related Work
Four Reasons for Unfairness And Their Measurement
Incompressability of Data
Causes Beyond Incompressibility
Measurements of Unfairness and Four Properties
Experimental Results - Who Is AD Unfair To?
Experimental Results - Why is AD Unfair
Relationship between Unfairness and Each Property
Relationship between Multiple Properties
Hypothesis Testing of Relationship Claims
A Proposed Model Of Unsupervised Unfairness Relationships
Conclusion, Limitations, and Future Work
Models
Raw Data Results

Figures (9)

Figure 1: Example of AD Being Unfair When Applied to Facial Imaging Data. Reproduced from zhang2021towards.
Figure 2: A Diagrammatic view of the expected reasons behind biased outlier detection.
Figure 3: A frequency distribution of the Anomaly DIR score versus how often it occurs across all algorithms and datasets.
Figure 4: A frequency distribution of the Anomaly DIR score by algorithm. We see that the AE with a more flexible definition of normality is more fair.
Figure 5: (Figure continues on next page)
...and 4 more figures

Foundations for Unfairness in Anomaly Detection -- Case Studies in Facial Imaging Data

TL;DR

Abstract

Foundations for Unfairness in Anomaly Detection -- Case Studies in Facial Imaging Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)