Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production

Patrick Oliver Schenk; Christoph Kern

Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production

Patrick Oliver Schenk, Christoph Kern

TL;DR

This paper argues that algorithmic fairness should be treated as a dedicated quality dimension within the Quality Framework for Statistical Algorithms (QF4SA) for NSOs. It provides a mapping between QF4SA's dimensions—$Accuracy$, $Timeliness$, $Cost ext{-}effectiveness$, $Explainability$, and $Reproducibility$ (and robustness)$—and fairness concepts, highlighting interactions and data-centric considerations. An empirical LTU example demonstrates how subgroup fairness metrics can inform interpretations and reporting, while the discussion outlines practical implications for interpretability, robustness, and uncertainty in official statistics contexts. The work advances trustworthy ML in NSOs by integrating fairness into quality assessment, offering methodological guidance for detecting, diagnosing, and mitigating fairness-related issues in data collection, processing, and analysis, and encouraging collaboration and data sharing to improve overall data quality and equity.

Abstract

National Statistical Organizations (NSOs) increasingly draw on Machine Learning (ML) to improve the timeliness and cost-effectiveness of their products. When introducing ML solutions, NSOs must ensure that high standards with respect to robustness, reproducibility, and accuracy are upheld as codified, e.g., in the Quality Framework for Statistical Algorithms (QF4SA; Yung et al. 2022). At the same time, a growing body of research focuses on fairness as a pre-condition of a safe deployment of ML to prevent disparate social impacts in practice. However, fairness has not yet been explicitly discussed as a quality aspect in the context of the application of ML at NSOs. We employ Yung et al. (2022)'s QF4SA quality framework and present a mapping of its quality dimensions to algorithmic fairness. We thereby extend the QF4SA framework in several ways: we argue for fairness as its own quality dimension, we investigate the interaction of fairness with other dimensions, and we explicitly address data, both on its own and its interaction with applied methodology. In parallel with empirical illustrations, we show how our mapping can contribute to methodology in the domains of official statistics, algorithmic fairness, and trustworthy machine learning.

Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production

TL;DR

, and

(and robustness)$—and fairness concepts, highlighting interactions and data-centric considerations. An empirical LTU example demonstrates how subgroup fairness metrics can inform interpretations and reporting, while the discussion outlines practical implications for interpretability, robustness, and uncertainty in official statistics contexts. The work advances trustworthy ML in NSOs by integrating fairness into quality assessment, offering methodological guidance for detecting, diagnosing, and mitigating fairness-related issues in data collection, processing, and analysis, and encouraging collaboration and data sharing to improve overall data quality and equity.

Abstract

Paper Structure (39 sections, 4 figures, 1 table)

This paper contains 39 sections, 4 figures, 1 table.

Introduction
Official Statistics, Other Data Producers, and Machine Learning
Fairness
Contribution
Structure
Background: Machine Learning
ML and Statistics, Supervised and Unsupervised Learning
The Machine Learning Mindset, Procedural and Methodological Benefits, and a Comparison to Statistics
Overarching Drivers and Goals of NSOs
Applications and Tasks for ML in NSOs
Before and During Collection of (Traditional) Data
Processing and Adjusting Data
Analysing Data
Outlook
Compatibility
...and 24 more sections

Figures (4)

Figure 1: Surrogate model explanations of a random forest predicting long-term unemployment, computed by protected group membership.
Figure 2: (Change in) prediction performance and selected fairness metrics for random forest models over time. For each year, a new random forest is trained and evaluated with data from the next year. Parity difference scores show the difference in predicted LTU rates between non-German and German job seekers. FNR difference scores show the difference in false negative rates between non-Germans and Germans.
Figure 3: Jaccard similarities between LTU predictions of random forest models with different hyper-parameter settings (RF 1: ntree = 750, nodesize = 1, RF 2: ntree = 250, nodesize = 1, RF 3: ntree = 500, nodesize = 5, RF 4: ntree = 500, nodesize = 15), computed by protected group membership.
Figure 4: Subgroup prediction performance (balanced accuracy) of a random forest predicting long-term unemployment. Group coding scheme: Citizenship (0: non-German, 1: German) -- Gender (0: Male, 1: Female) -- Age group (1: 18--30, 2: 31--50, 3: $>$50).

Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production

TL;DR

Abstract

Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production

Authors

TL;DR

Abstract

Table of Contents

Figures (4)