An In-Depth Examination of Risk Assessment in Multi-Class Classification Algorithms
Disha Ghandwani, Neeraj Sarna, Yuanyuan Li, Yang Lin
TL;DR
This work tackles risk assessment for multi-class classifiers by estimating the probability of miss-classification and exploring two main approaches: calibration of output probabilities and a conformal-prediction–based method. The authors compare six risk-estimation techniques across diverse datasets and model families, including a novel inverse conformal prediction (InvCP) framework that reframes risk as the inverse of CP coverage, using a calibrated miss-coverage level to bound the misclassification probability. Empirical results show that no single method dominates: calibration methods excel on large label-count tasks, while CP-based approaches offer robust, hyper-parameter–free estimates that are often competitive, especially on smaller-label datasets. The findings underscore the importance of task characteristics in selecting a risk-estimation strategy and highlight InvCP as a practical, model-agnostic tool with conservative guarantees for safety-critical applications.
Abstract
Advanced classification algorithms are being increasingly used in safety-critical applications like health-care, engineering, etc. In such applications, miss-classifications made by ML algorithms can result in substantial financial or health-related losses. To better anticipate and prepare for such losses, the algorithm user seeks an estimate for the probability that the algorithm miss-classifies a sample. We refer to this task as the risk-assessment. For a variety of models and datasets, we numerically analyze the performance of different methods in solving the risk-assessment problem. We consider two solution strategies: a) calibration techniques that calibrate the output probabilities of classification models to provide accurate probability outputs; and b) a novel approach based upon the prediction interval generation technique of conformal prediction. Our conformal prediction based approach is model and data-distribution agnostic, simple to implement, and provides reasonable results for a variety of use-cases. We compare the different methods on a broad variety of models and datasets.
