Table of Contents
Fetching ...

Selecting a classification performance measure: matching the measure to the problem

David J. Hand, Peter Christen, Sumayya Ziyad

TL;DR

The paper argues that choosing a classifier performance measure must reflect the specific aims and constraints of the problem rather than defaulting to common metrics. It distinguishes structural properties of measures from problem-aim–driven properties, and defines a framework to assess measures through a confusion-matrix lens. A wide range of crisp binary measures is catalogued with their definitions and properties, and the authors advocate tailoring measure choice to the task, including handling unknown class distributions and avoiding misplaced reliance on threshold-averaged metrics. The work emphasizes practical guidance for researchers to articulate aims, constraints, and justification when evaluating classification methods, aiming to reduce misinterpretation and misapplication across domains.

Abstract

The problem of identifying to which of a given set of classes objects belong is ubiquitous, occurring in many research domains and application areas, including medical diagnosis, financial decision making, online commerce, and national security. But such assignments are rarely completely perfect, and classification errors occur. This means it is necessary to compare classification methods and algorithms to decide which is ``best'' for any particular problem. However, just as there are many different classification methods, so there are many different ways of measuring their performance. It is thus vital to choose a measure of performance which matches the aims of the research or application. This paper is a contribution to the growing literature on the relative merits of different performance measures. Its particular focus is the critical importance of matching the properties of the measure to the aims for which the classification is being made.

Selecting a classification performance measure: matching the measure to the problem

TL;DR

The paper argues that choosing a classifier performance measure must reflect the specific aims and constraints of the problem rather than defaulting to common metrics. It distinguishes structural properties of measures from problem-aim–driven properties, and defines a framework to assess measures through a confusion-matrix lens. A wide range of crisp binary measures is catalogued with their definitions and properties, and the authors advocate tailoring measure choice to the task, including handling unknown class distributions and avoiding misplaced reliance on threshold-averaged metrics. The work emphasizes practical guidance for researchers to articulate aims, constraints, and justification when evaluating classification methods, aiming to reduce misinterpretation and misapplication across domains.

Abstract

The problem of identifying to which of a given set of classes objects belong is ubiquitous, occurring in many research domains and application areas, including medical diagnosis, financial decision making, online commerce, and national security. But such assignments are rarely completely perfect, and classification errors occur. This means it is necessary to compare classification methods and algorithms to decide which is ``best'' for any particular problem. However, just as there are many different classification methods, so there are many different ways of measuring their performance. It is thus vital to choose a measure of performance which matches the aims of the research or application. This paper is a contribution to the growing literature on the relative merits of different performance measures. Its particular focus is the critical importance of matching the properties of the measure to the aims for which the classification is being made.
Paper Structure (8 sections, 8 equations, 3 figures, 1 table)

This paper contains 8 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Two simulated sets of 100 data points, and the resulting optimal decision boundary for three performance measures (Matthews Correlation Coefficient, Error Rate, and F-measure, as discussed in Section \ref{['sec:measures']}).
  • Figure 2: Ranking of classifier performance for two data sets, evaluated using ten different performance measures (as defined in Section \ref{['sec:measures']}). The four classifiers are a decision tree (D), logistic regression (L), a random forest (R), and a support vector machine (S), as indicated by the node labels.
  • Figure 3: Notation for confusion matrix.