Table of Contents
Fetching ...

Fragility-aware Classification for Understanding Risk and Improving Generalization

Chen Yang, Zheng Cui, Daniel Zhuoyu Long, Jin Qi, Ruohan Zhan

TL;DR

This work introduces the Fragility Index ($FI$), a risk-averse metric that captures the tail risk of confident misjudgments in multi-class classification. Framed within robust satisficing ($RS$), it accounts for data uncertainty and distributional shifts to improve generalization. The authors derive exact convex reformulations for $FI$ under KL-divergence and 1-Wasserstein distance for cross-entropy, hinge-type, and Lipschitz losses, and extend the approach to deep learning with an FI-regularized objective. Empirical results on synthetic data and medical diagnosis tasks show FI identifies misjudgment risk and FI-based training enhances robustness and generalization, with FI-based ResNet on MedMNIST illustrating improved cross-entropy and FI while maintaining competitive accuracy and AUC. The paper also clarifies connections to DRO, provides finite-sample guarantees, and outlines practical implications for deploying risk-aware classifiers in safety-critical settings.

Abstract

Classification models play a critical role in data-driven decision-making applications such as medical diagnosis, user profiling, recommendation systems, and default detection. Traditional performance metrics, such as accuracy, focus on overall error rates but fail to account for the confidence of incorrect predictions, thereby overlooking the risk of confident misjudgments. This risk is particularly significant in cost-sensitive and safety-critical domains like medical diagnosis and autonomous driving, where overconfident false predictions may cause severe consequences. To address this issue, we introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective by explicitly capturing the tail risk of confident misjudgments. To enhance generalizability, we define FI within the robust satisficing (RS) framework, incorporating data uncertainty. We further develop a model training approach that optimizes FI while maintaining tractability for common loss functions. Specifically, we derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models. Through synthetic experiments and real-world medical diagnosis tasks, we demonstrate that FI effectively identifies misjudgment risk and FI-based training improves model robustness and generalizability. Finally, we extend our framework to deep neural network training, further validating its effectiveness in enhancing deep learning models.

Fragility-aware Classification for Understanding Risk and Improving Generalization

TL;DR

This work introduces the Fragility Index (), a risk-averse metric that captures the tail risk of confident misjudgments in multi-class classification. Framed within robust satisficing (), it accounts for data uncertainty and distributional shifts to improve generalization. The authors derive exact convex reformulations for under KL-divergence and 1-Wasserstein distance for cross-entropy, hinge-type, and Lipschitz losses, and extend the approach to deep learning with an FI-regularized objective. Empirical results on synthetic data and medical diagnosis tasks show FI identifies misjudgment risk and FI-based training enhances robustness and generalization, with FI-based ResNet on MedMNIST illustrating improved cross-entropy and FI while maintaining competitive accuracy and AUC. The paper also clarifies connections to DRO, provides finite-sample guarantees, and outlines practical implications for deploying risk-aware classifiers in safety-critical settings.

Abstract

Classification models play a critical role in data-driven decision-making applications such as medical diagnosis, user profiling, recommendation systems, and default detection. Traditional performance metrics, such as accuracy, focus on overall error rates but fail to account for the confidence of incorrect predictions, thereby overlooking the risk of confident misjudgments. This risk is particularly significant in cost-sensitive and safety-critical domains like medical diagnosis and autonomous driving, where overconfident false predictions may cause severe consequences. To address this issue, we introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective by explicitly capturing the tail risk of confident misjudgments. To enhance generalizability, we define FI within the robust satisficing (RS) framework, incorporating data uncertainty. We further develop a model training approach that optimizes FI while maintaining tractability for common loss functions. Specifically, we derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models. Through synthetic experiments and real-world medical diagnosis tasks, we demonstrate that FI effectively identifies misjudgment risk and FI-based training improves model robustness and generalizability. Finally, we extend our framework to deep neural network training, further validating its effectiveness in enhancing deep learning models.

Paper Structure

This paper contains 77 sections, 28 theorems, 169 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

The Fragility Index $\mathrm{FI}(h; \tau)$ has the following properties.

Figures (6)

  • Figure 1: Distribution of ranking errors and classifier's estimated probability of the false predictions of two classifiers in the example.
  • Figure 2: The relationship between the training sample size and the accuracy, AUC, and FI when $p_{flip} = 0$. The colors and line styles represent different models and parameters. The values $1.1$ and $1.2$ represent the target ratio $\lambda$ used in each model. The error bands are calculated by 95% confidence interval.
  • Figure 3: The relationship between the label-flipping rate $p_{flip}$ and the accuracy, AUC, and FI when the sample size is 50. The colors and line styles represent different models. The values $1.1$ and $1.2$ represent the target ratio $\lambda$ used in each model. The error bands are calculated by 95% confidence interval.
  • Figure 4: The results of the average accuracy, AUC, and FI on the heart failure prediction dataset. The error bands are calculated by $95\%$ confidence intervals.
  • Figure 5: The ranking error and classifiers' estimated probability of the three models on the heart failure prediction dataset when $p_{flip} = 0$.
  • ...and 1 more figures

Theorems & Definitions (30)

  • Definition 1: Fragility Index
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Proposition 1
  • Corollary 1
  • Lemma 3
  • Theorem 2
  • Proposition 2
  • Lemma 4
  • ...and 20 more