Distribution-Free Statistical Dispersion Control for Societal Applications

Zhun Deng; Thomas P. Zollo; Jake C. Snell; Toniann Pitassi; Richard Zemel

Distribution-Free Statistical Dispersion Control for Societal Applications

Zhun Deng, Thomas P. Zollo, Jake C. Snell, Toniann Pitassi, Richard Zemel

TL;DR

This work tackles the challenge of providing distribution-free guarantees for dispersion of predictive losses across a population, not just expected loss. It develops a two-step framework: first obtain two-sided confidence bounds on the loss CDF $F$ from validation data, then propagate these bounds to nonlinear and group-based dispersion functionals such as the Gini coefficient, Atkinson index, and CVaR-based fairness metrics. A novel optimization procedure is introduced to tighten bounds in data-scarce regimes, including a neural-parameterized approach for selecting bound thresholds and a post-processing step to guarantee the distribution-free constraints. The authors validate the framework on toxic-comment detection, medical-imaging, and recommendation tasks, showing improved fairness-oriented model selection and tighter, more reliable bounds on societal dispersion measures. This advances responsible ML by equipping practitioners with robust, interpretable guarantees for distribution-wide equity metrics in high-stakes applications.

Abstract

Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications, it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.

Distribution-Free Statistical Dispersion Control for Societal Applications

TL;DR

from validation data, then propagate these bounds to nonlinear and group-based dispersion functionals such as the Gini coefficient, Atkinson index, and CVaR-based fairness metrics. A novel optimization procedure is introduced to tighten bounds in data-scarce regimes, including a neural-parameterized approach for selecting bound thresholds and a post-processing step to guarantee the distribution-free constraints. The authors validate the framework on toxic-comment detection, medical-imaging, and recommendation tasks, showing improved fairness-oriented model selection and tighter, more reliable bounds on societal dispersion measures. This advances responsible ML by equipping practitioners with robust, interpretable guarantees for distribution-wide equity metrics in high-stakes applications.

Abstract

Paper Structure (65 sections, 7 theorems, 82 equations, 6 figures, 2 tables)

This paper contains 65 sections, 7 theorems, 82 equations, 6 figures, 2 tables.

Introduction
Problem setup
Statistical dispersion measures for societal applications
Standard measures of dispersion
Gini family of measures.
Atkinson index.
Group-based measures of dispersion
Absolute/quadratic difference of risks and beyond.
CVaR-fairness risk measure and its extensions.
Uncertainty quantification of risk measures.
Distribution-free control of societal dispersion measures
Methods to obtain confidence two-sided bounds for CDFs
A reduction approach to constructing upper bounds of CDFs
Controlling statistical dispersion measures
Control of nonlinear functions of CDFs
...and 50 more sections

Key Result

proposition 1

For the CDF $F$ of $X$, if there exists two CDFs $F_U,F_L$ such that $F_U \succeq F \succeq F_L$, then we have $F^-_L \succeq F^- \succeq F^-_U$.

Figures (6)

Figure 1: Example illustrating how two predictors (here $h_1$ and $h_2$) with the same expected loss can induce very different loss dispersion across the population. Left: The loss CDF produced by each predictor is bounded from below and above. Middle: The Lorenz curve is a popular graphical representation of inequality in some quantity across a population, in our case expressing the cumulative share of the loss experienced by the best-off $\beta$ proportion of the population. CDF upper and lower bounds can be used to bound the Lorenz curve (and thus Gini coefficient, a function of the shape of the Lorenz curve). Under $h_2$ the worst-off population members experience most of the loss. Right: Predictors with the same expected loss may induce different median loss for (possibly protected) subgroups in the data, and thus we may wish to bound these differences.
Figure 2: Left: Bounds on the expected loss, scaled Gini coefficient, and total objective across different hypotheses. Right: Lorenz curves induced by choosing a hypothesis based on the expected loss bound versus the bound on the total objective. The y-axis shows the cumulative share of the loss that is incurred by the best-off $\beta$ proportion of the population, where a perfectly fair predictor would produce a distribution along the line $y=x$.
Figure 3: We select two hypotheses $h_0$ and $h_1$ with different bounds on Atkinson index produced using 2000 validation samples, and once again visualize the Lorenz curves induced by each. Tighter control on the Atkinson index leads to a more equal distribution of the loss (especially across the middle of the distribution, which aligns with the choice of $\epsilon$), highlighting the utility of being able to target such a metric in conservative model selection.
Figure 4: Example illustrating the construction of distribution-free CDF lower and upper bounds by bounding order statistics. On the left, order statistics are drawn from a uniform distribution. On the right, samples are drawn from a real loss distribution, and the corresponding Berk-Jones CDF lower and upper bound are shown in black. Our distribution-free method gives bound $b^{(l)}_i$ and $b^{(u)}_i$ on each sorted order statistic such that the bound depends only on $i$, as illustrated in the plots for $i=5$ (shown in blue). On the left, 1000 realizations of $x_{(1)}, \ldots, x_{(n)}$ are shown in yellow. On the right, 1000 empirical CDFs are shown in yellow, and the true CDF $F$ is shown in red.
Figure 5: Plot of smoothed median function with $\beta=0.5$ and $a=0.01$
...and 1 more figures

Theorems & Definitions (19)

definition 1: Quantile-based Risk Measure
definition 2: Gini coefficient
definition 3: Atkinson index
remark 1
proposition 1
lemma 1
theorem 1
proposition 2: Restatement of Proposition \ref{['prop:flip']}
proof
Claim 2
...and 9 more

Distribution-Free Statistical Dispersion Control for Societal Applications

TL;DR

Abstract

Distribution-Free Statistical Dispersion Control for Societal Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (19)