Table of Contents
Fetching ...

Perturbation Effects on Accuracy and Fairness among Similar Individuals

Xuran Li, Hao Xue, Peng Wu, Xingjun Ma, Zhen Zhang, Huaming Chen, Flora D. Salim

TL;DR

This work addresses the vulnerability of deep models to adversarial perturbations in high-stakes settings by introducing Robust Individual Fairness (RIF), a criterion that requires similar individuals to remain accurate and fairly treated under perturbations. It presents RIFair, a gradient-based attack that perturbs identical features across similar individuals to reveal violations of RIF, and introduces Perturbation Impact Index (PII) and Perturbation Impact Direction (PID) to quantify and explain why identical perturbations yield divergent outcomes. The study shows that existing robustness and fairness metrics capture distinct failure modes and that many instances are jointly vulnerable to multiple adversarial outcomes, underscoring the need for joint evaluation frameworks. Empirically, RIFair can strategically manipulate test-set metrics and demonstrates that identical perturbations can cause asynchronous prediction changes due to varying PII, reinforcing calls for evaluation protocols that safeguard both accuracy and fairness under perturbations in online decision-making. The findings motivate future work on robust training and detection methods that preserve ethical and functional correctness when models face adversarial manipulation.

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial perturbations that degrade both predictive accuracy and individual fairness, posing critical risks in high-stakes online decision-making. The relationship between these two dimensions of robustness remains poorly understood. To bridge this gap, we introduce robust individual fairness (RIF), which requires that similar individuals receive predictions consistent with the same ground truth even under adversarial manipulation. To evaluate and expose violations of RIF, we propose RIFair, an attack framework that applies identical perturbations to similar individuals to induce accuracy or fairness failures. We further introduce perturbation impact index (PII) and perturbation impact direction (PID) to quantify and explain why identical perturbations produce unequal effects on individuals who should behave similarly. Experiments across diverse model architectures and real-world web datasets reveal that existing robustness metrics capture distinct and often incompatible failure modes in accuracy and fairness. We find that many online applicants are simultaneously vulnerable to multiple types of adversarial failures, and that inaccurate or unfair outcomes arise due to similar individuals share the same PID but have sharply different PIIs, leading to divergent prediction-change trajectories in which some cross decision boundaries earlier. Finally, we demonstrate that adversarial examples generated by RIFair can strategically manipulate test-set accuracy or fairness by replacing only a small subset of items, creating misleading impressions of model performance. These findings expose fundamental limitations in current robustness evaluations and highlight the need for jointly assessing accuracy and fairness under adversarial perturbations in high-stakes online decision-making.

Perturbation Effects on Accuracy and Fairness among Similar Individuals

TL;DR

This work addresses the vulnerability of deep models to adversarial perturbations in high-stakes settings by introducing Robust Individual Fairness (RIF), a criterion that requires similar individuals to remain accurate and fairly treated under perturbations. It presents RIFair, a gradient-based attack that perturbs identical features across similar individuals to reveal violations of RIF, and introduces Perturbation Impact Index (PII) and Perturbation Impact Direction (PID) to quantify and explain why identical perturbations yield divergent outcomes. The study shows that existing robustness and fairness metrics capture distinct failure modes and that many instances are jointly vulnerable to multiple adversarial outcomes, underscoring the need for joint evaluation frameworks. Empirically, RIFair can strategically manipulate test-set metrics and demonstrates that identical perturbations can cause asynchronous prediction changes due to varying PII, reinforcing calls for evaluation protocols that safeguard both accuracy and fairness under perturbations in online decision-making. The findings motivate future work on robust training and detection methods that preserve ethical and functional correctness when models face adversarial manipulation.

Abstract

Deep neural networks (DNNs) are vulnerable to adversarial perturbations that degrade both predictive accuracy and individual fairness, posing critical risks in high-stakes online decision-making. The relationship between these two dimensions of robustness remains poorly understood. To bridge this gap, we introduce robust individual fairness (RIF), which requires that similar individuals receive predictions consistent with the same ground truth even under adversarial manipulation. To evaluate and expose violations of RIF, we propose RIFair, an attack framework that applies identical perturbations to similar individuals to induce accuracy or fairness failures. We further introduce perturbation impact index (PII) and perturbation impact direction (PID) to quantify and explain why identical perturbations produce unequal effects on individuals who should behave similarly. Experiments across diverse model architectures and real-world web datasets reveal that existing robustness metrics capture distinct and often incompatible failure modes in accuracy and fairness. We find that many online applicants are simultaneously vulnerable to multiple types of adversarial failures, and that inaccurate or unfair outcomes arise due to similar individuals share the same PID but have sharply different PIIs, leading to divergent prediction-change trajectories in which some cross decision boundaries earlier. Finally, we demonstrate that adversarial examples generated by RIFair can strategically manipulate test-set accuracy or fairness by replacing only a small subset of items, creating misleading impressions of model performance. These findings expose fundamental limitations in current robustness evaluations and highlight the need for jointly assessing accuracy and fairness under adversarial perturbations in high-stakes online decision-making.
Paper Structure (15 sections, 6 theorems, 16 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 15 sections, 6 theorems, 16 equations, 2 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

If a classifier $f: X \times A \rightarrow Y$ exhibits robust individual fairness for an instance $v$ under perturbations $\Delta \subseteq \mathbb{R}^{m+n}$, then: for any similar adversarial instance $v_{\text{adv}}' \in I(v_{\text{adv}}) \setminus \{v_{\text{adv}}\}$.

Figures (2)

  • Figure 1: Similar individual prediction changes under the same perturbation.
  • Figure 2: Illustration of prediction-change trajectories and corresponding perturbation impact index (PII) and perturbation impact direction (PID) for true-bias (TB), false-bias (FB), and false-fair (FF) perturbations.

Theorems & Definitions (8)

  • Definition 1: Robust Individual Fairness
  • Definition 2: Empirical Evaluation of RIF
  • Theorem 1
  • Theorem 2: Single-Perturbation Effect
  • Theorem 3: Cumulative Perturbation Impact
  • Theorem 4: Perturbation Needed for a Decision Flip
  • Theorem 5: Perturbation Tolerance Before a Flip
  • Theorem 6