Table of Contents
Fetching ...

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck

TL;DR

The Tile addresses the challenge of ranking two-class classifiers across diverse, application-specific preferences by organizing an infinite family of ranking scores into a two-dimensional map. It builds canonical ranking scores via $\rankingScore[I_{a,b}] = \frac{(1-a)PTN + aPTP}{(1-a)PTN + (1-b)PFP + bPFN + aPTP}$ and shows how familiar metrics like $A$, $TPR$, $TNR$, $PPV$, $NPV$, and $\scoreFBeta$ are instances, enabling unified interpretation through iso-performance lines in ROC space. The Tile supports reading, comparing, and ranking classifiers, analyzes the impact of priors, investigates no-skill performances with curves $\gamma_\pi$ and $\gamma_\tau$, and links to existing evaluation spaces while revealing the geometry of score-induced orderings. This framework provides a practical and rigorous tool for application-aware model selection, robustness assessment, and deeper understanding of ranking properties beyond traditional single-score or two-score plots. The approach has potential to influence benchmarking, model comparison, and the design of evaluation metrics by emphasizing continuous, prior-aware rankings across a visually intuitive surface.

Abstract

In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

TL;DR

The Tile addresses the challenge of ranking two-class classifiers across diverse, application-specific preferences by organizing an infinite family of ranking scores into a two-dimensional map. It builds canonical ranking scores via and shows how familiar metrics like , , , , , and are instances, enabling unified interpretation through iso-performance lines in ROC space. The Tile supports reading, comparing, and ranking classifiers, analyzes the impact of priors, investigates no-skill performances with curves and , and links to existing evaluation spaces while revealing the geometry of score-induced orderings. This framework provides a practical and rigorous tool for application-aware model selection, robustness assessment, and deeper understanding of ranking properties beyond traditional single-score or two-score plots. The approach has potential to influence benchmarking, model comparison, and the design of evaluation metrics by emphasizing continuous, prior-aware rankings across a visually intuitive surface.

Abstract

In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.

Paper Structure

This paper contains 53 sections, 10 theorems, 53 equations, 9 figures, 2 tables.

Key Result

Lemma 1

Let $\mathrm{change}_{\hat{Y}}:\mathbb{P}_{(\Omega,\Sigma)}\rightarrow\mathbb{P}_{(\Omega,\Sigma)}$ be the operation that changes the predicted class $\hat{Y}$. We have $\rankingScore[I_{a,b}]\circ\mathrm{change}_{\hat{Y}}=1-\rankingScore[I_{b,a}]$.

Figures (9)

  • Figure 1: Introducing the Tile. We introduce a new visual tool, called the Tile, representing an infinite family of ranking scores to evaluate the performances of two-class classifiers at a glance. In this figure, we highlight the correspondences between specific ranking scores on the Tile and their corresponding set of iso-performance lines in the ROC space. Notably, the variation of iso-performance lines along the right border of the Tile demonstrates the limitations of the ROC space for ranking performance. This visualization illustrates how the Tile simplifies the task of ranking classifiers and enhances the interpretation of performance scores across various evaluation spaces, such as the ROC space.
  • Figure 2: The geometry of the ranking scores $\rankingScore[I_{a,b}]$ in the ROC plane ($FPR,TPR)$. Example given for the class priors $\pi_+=1-\pi_-=0.2$ and the importance given by $(a,b)=(0.95,0.7)$.
  • Figure 3: Placement of the canonical ranking scores (left) and of some performance orderings (right) on the Tile. The symbol $\dagger$ indicates the orderings that are specific for given priors. For the orderings whose locations are prior-dependent, we arbitrarily chose a negative prior of $0.7$. Double arrows $\leftrightarrow$ indicate the direction in which $\lesssim_{WA}$ moves when the weights are tuned and how the curve on which $\lesssim_{BA}$ and $\lesssim_{\kappa}$ moves when the priors are tuned. The colored points correspond to probabilistic scores.
  • Figure 4: Tiles showing the rank correlations (Kendall $\tau$) between $9$ probabilistic scores (those that belong to the ranking scores, as given in \ref{['example:probabilistic-ranking-scores']}), and all ranking scores, for a uniform distribution of performances. The correlation values have been estimated based on $10{,}000$ performances drawn at random.
  • Figure 5: Toy examples showing on the Tile which of the $4$ performances $P_{-}$, $P_{1}$, $P_{2}$, and $P_{+}$ is the best. The $3$ examples differ in the class priors (either $0.2$, $0.5$, or $0.8$ for the positive class). In all examples, $P_{-}$ ($\CIRCLE$) is the performance of classifiers predicting always the negative class, $P_{1}$ ($\CIRCLE$) is such that $TNR(P_{1})=0.7$ and $TPR(P_{1})=0.7$, $P_{2}$ ($\CIRCLE$) is such that $TNR(P_{2})=0.5$ and $TPR(P_{2})=0.8$, and $P_{+}$ ($\CIRCLE$) is the performance of classifiers predicting always the positive class.
  • ...and 4 more figures

Theorems & Definitions (26)

  • Example 1: Probabilistic ranking scores
  • Example 2: F-scores
  • Example 3: PABDC
  • Definition 1
  • Definition 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 16 more