Table of Contents
Fetching ...

Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems

Chengwen Du, Tao Chen

TL;DR

This paper tackles the problem of context sensitivity in fairness testing for deep learning systems by conducting a large-scale empirical study across $12$ datasets, $3$ test generators, $10$ context settings per type, and $10{,}800$ investigated cases. It systematically evaluates how hyperparameters, selection bias, and label bias affect both test adequacy and fairness metrics, using a mix of five test adequacy metrics and the IDI fairness metric. Its key findings show that non-optimized hyperparameters generally degrade adequacy and fairness discovery, while data bias can boost generator performance; context settings significantly alter results and generator rankings, with landscape structure (ruggedness and guidance) explaining much of this behavior. The study provides actionable insights and methodological guidance for evaluating fairness test generators under realistic, diverse contexts, and it releases code and materials to support reproducibility and further research in context-aware fairness testing.

Abstract

Background: Fairness testing for deep learning systems has been becoming increasingly important. However, much work assumes perfect context and conditions from the other parts: well-tuned hyperparameters for accuracy; rectified bias in data, and mitigated bias in the labeling. Yet, these are often difficult to achieve in practice due to their resource-/labour-intensive nature. Aims: In this paper, we aim to understand how varying contexts affect fairness testing outcomes. Method:We conduct an extensive empirical study, which covers $10,800$ cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. We also study why the outcomes were observed from the lens of correlation/fitness landscape analysis. Results: Our results show that different context types and settings generally lead to a significant impact on the testing, which is mainly caused by the shifts of the fitness landscape under varying contexts. Conclusions: Our findings provide key insights for practitioners to evaluate the test generators and hint at future research directions.

Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems

TL;DR

This paper tackles the problem of context sensitivity in fairness testing for deep learning systems by conducting a large-scale empirical study across datasets, test generators, context settings per type, and investigated cases. It systematically evaluates how hyperparameters, selection bias, and label bias affect both test adequacy and fairness metrics, using a mix of five test adequacy metrics and the IDI fairness metric. Its key findings show that non-optimized hyperparameters generally degrade adequacy and fairness discovery, while data bias can boost generator performance; context settings significantly alter results and generator rankings, with landscape structure (ruggedness and guidance) explaining much of this behavior. The study provides actionable insights and methodological guidance for evaluating fairness test generators under realistic, diverse contexts, and it releases code and materials to support reproducibility and further research in context-aware fairness testing.

Abstract

Background: Fairness testing for deep learning systems has been becoming increasingly important. However, much work assumes perfect context and conditions from the other parts: well-tuned hyperparameters for accuracy; rectified bias in data, and mitigated bias in the labeling. Yet, these are often difficult to achieve in practice due to their resource-/labour-intensive nature. Aims: In this paper, we aim to understand how varying contexts affect fairness testing outcomes. Method:We conduct an extensive empirical study, which covers cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. We also study why the outcomes were observed from the lens of correlation/fitness landscape analysis. Results: Our results show that different context types and settings generally lead to a significant impact on the testing, which is mainly caused by the shifts of the fitness landscape under varying contexts. Conclusions: Our findings provide key insights for practitioners to evaluate the test generators and hint at future research directions.
Paper Structure (33 sections, 2 equations, 8 figures, 6 tables)

This paper contains 33 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Distribution of the fairness categories.
  • Figure 2: Distribution of context type the setting considered.
  • Figure 3: The commonality of test generators evaluated in fairness testing (a paper might involve multiple generators). The y-axis shows the relevant paper count.
  • Figure 4: The protocol of our empirical study
  • Figure 5: The $\#$ cases with different ranges for the Harmonic mean $p$-value from the 15 cases of generators and metrics.
  • ...and 3 more figures