Comparing Classifiers: A Case Study Using PyCM
Sadra Sabouri, Alireza Zolanvari, Sepand Haghighi
TL;DR
The paper argues that accuracy alone is insufficient for evaluating multi-class classifiers, especially with imbalanced data and high-stakes domains. It advocates a multi-metric evaluation framework implemented in the PyCM library, which aggregates diverse class- and overall-level metrics and curves into a unified analysis, including a Compare tool for ranking models. Through a Covertype-case study and scenario-weighted evaluations (flammability and riparian priorities), it shows how different metrics and class-importance weights can yield different model choices. The work contributes a practical, transparent guide and tooling for multi-dimensional model assessment, aiming to improve reliability and decision-making in real-world deployments.
Abstract
Selecting an optimal classification model requires a robust and comprehensive understanding of the performance of the model. This paper provides a tutorial on the PyCM library, demonstrating its utility in conducting deep-dive evaluations of multi-class classifiers. By examining two different case scenarios, we illustrate how the choice of evaluation metrics can fundamentally shift the interpretation of a model's efficacy. Our findings emphasize that a multi-dimensional evaluation framework is essential for uncovering small but important differences in model performance. However, standard metrics may miss these subtle performance trade-offs.
