Table of Contents
Fetching ...

Comparing Classifiers: A Case Study Using PyCM

Sadra Sabouri, Alireza Zolanvari, Sepand Haghighi

TL;DR

The paper argues that accuracy alone is insufficient for evaluating multi-class classifiers, especially with imbalanced data and high-stakes domains. It advocates a multi-metric evaluation framework implemented in the PyCM library, which aggregates diverse class- and overall-level metrics and curves into a unified analysis, including a Compare tool for ranking models. Through a Covertype-case study and scenario-weighted evaluations (flammability and riparian priorities), it shows how different metrics and class-importance weights can yield different model choices. The work contributes a practical, transparent guide and tooling for multi-dimensional model assessment, aiming to improve reliability and decision-making in real-world deployments.

Abstract

Selecting an optimal classification model requires a robust and comprehensive understanding of the performance of the model. This paper provides a tutorial on the PyCM library, demonstrating its utility in conducting deep-dive evaluations of multi-class classifiers. By examining two different case scenarios, we illustrate how the choice of evaluation metrics can fundamentally shift the interpretation of a model's efficacy. Our findings emphasize that a multi-dimensional evaluation framework is essential for uncovering small but important differences in model performance. However, standard metrics may miss these subtle performance trade-offs.

Comparing Classifiers: A Case Study Using PyCM

TL;DR

The paper argues that accuracy alone is insufficient for evaluating multi-class classifiers, especially with imbalanced data and high-stakes domains. It advocates a multi-metric evaluation framework implemented in the PyCM library, which aggregates diverse class- and overall-level metrics and curves into a unified analysis, including a Compare tool for ranking models. Through a Covertype-case study and scenario-weighted evaluations (flammability and riparian priorities), it shows how different metrics and class-importance weights can yield different model choices. The work contributes a practical, transparent guide and tooling for multi-dimensional model assessment, aiming to improve reliability and decision-making in real-world deployments.

Abstract

Selecting an optimal classification model requires a robust and comprehensive understanding of the performance of the model. This paper provides a tutorial on the PyCM library, demonstrating its utility in conducting deep-dive evaluations of multi-class classifiers. By examining two different case scenarios, we illustrate how the choice of evaluation metrics can fundamentally shift the interpretation of a model's efficacy. Our findings emphasize that a multi-dimensional evaluation framework is essential for uncovering small but important differences in model performance. However, standard metrics may miss these subtle performance trade-offs.
Paper Structure (16 sections, 12 equations, 3 figures, 4 tables)

This paper contains 16 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: PyCM in the Machine Learning Workflow. The workflow starts with Initiation (Problem Definition, Data Collection, and Preprocessing), followed by Training (Model Selection and Training). Evaluation, where PyCM is used, includes Metric Selection and Model Comparison. Results from the evaluation guide Model Tuning. The final phase is Production, which includes Monitoring and Maintenance to ensure long-term reliability.
  • Figure 2: ROC curves (a) and PR curves (b) for a multi-class classifier distinguishing between Healthy, Flu, and COVID cases as our running example for a hypothetical classifier. Area under the curve (AUC) values are displayed for each class. These values highlight how the model’s performance varies across the different categories.
  • Figure 3: Comparison of confusion matrices for (a) Classifier 1 and (b) Classifier 2. Each matrix shows the number of predictions per class.