Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation
David H. Brown, Davide Chicco
TL;DR
The paper introduces Interactive Classification Metrics (ICM), a local, browser-based visualization tool that helps practitioners understand and compare binary classification evaluation metrics by interactively varying class distributions and decision thresholds. By providing plots for $ROC$, $ROC AUC$, $PR$, $PR AUC$, MCC-related curves, and basic rates (e.g., $Accuracy$, $Recall$, $Specificity$, $NPV$, $PPV$, $F1$), ICM highlights tradeoffs and context-dependent interpretations that are often overlooked. The authors demonstrate a common pitfall—relying on $Accuracy$ on imbalanced data—where $MCC$ and PR-based metrics can reveal underlying performance that accuracy misses, with shading and contextual baselines aiding interpretation. This tool fills a gap in educational resources by offering an interactive, comprehensive visualization platform that does not require data cleaning or model retraining, and is freely available under the MIT license for broad adoption in education and practice, potentially improving metric selection and interpretation across binary and multi-class scenarios. $MCC$-based insights, ROC/PR tradeoffs, and distribution-manipulation capabilities collectively advance practical model evaluation and intuition in real-world settings.
Abstract
Machine learning continues to grow in popularity in academia, in industry, and is increasingly used in other fields. However, most of the common metrics used to evaluate even simple binary classification models have shortcomings that are neither immediately obvious nor consistently taught to practitioners. Here we present Interactive Classification Metrics (ICM), an application to visualize and explore the relationships between different evaluation metrics. The user changes the distribution statistics and explores corresponding changes across a suite of evaluation metrics. The interactive, graphical nature of this tool emphasizes the tradeoffs of each metric without the overhead of data wrangling and model training. The goals of this application are: (1) to aid practitioners in the ever-expanding machine learning field to choose the most appropriate evaluation metrics for their classification problem; (2) to promote careful attention to interpretation that is required even in the simplest scenarios like binary classification. Our application is publicly available for free under the MIT license as a Python package on PyPI at https://pypi.org/project/interactive-classification-metrics and on GitHub at https://github.com/davhbrown/interactive_classification_metrics.
