Table of Contents
Fetching ...

Standing on the shoulders of giants

Lucas Felipe Ferraro Cardoso, José de Sousa Ribeiro Filho, Vitor Cirilo Araujo Santos, Regiane Silva Kawasaki Frances, Ronnie Cley de Oliveira Alves

TL;DR

This work addresses the insufficiency of standard confusion-matrix metrics to capture instance-level difficulty in ML evaluation by applying Item Response Theory (IRT), specifically the 3-Parameter Logistic model, to map instance parameters to model ability. By treating models as respondents and test instances as items, the authors compute instance-level probabilities $P(U_{ij}=1|\theta_j)$, True Score $TrueS_j$, and Total Score $TotalS_j$, and evaluate them on the Heart-Statlog dataset using 200 random classifiers and 10 base models. Statistical tests (Friedman and Nemenyi) reveal that Total Score captures significant differences from many classical metrics (97% confidence across 66% of metrics), and ICCCM analyses provide nuanced insights into which instances truly validate model performance. The study demonstrates that IRT augments confusion-matrix analysis, enabling context-aware model selection and exposing limitations of aggregate metrics, with future work extending to more datasets and data-complexity-aware evaluation metrics.

Abstract

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.

Standing on the shoulders of giants

TL;DR

This work addresses the insufficiency of standard confusion-matrix metrics to capture instance-level difficulty in ML evaluation by applying Item Response Theory (IRT), specifically the 3-Parameter Logistic model, to map instance parameters to model ability. By treating models as respondents and test instances as items, the authors compute instance-level probabilities , True Score , and Total Score , and evaluate them on the Heart-Statlog dataset using 200 random classifiers and 10 base models. Statistical tests (Friedman and Nemenyi) reveal that Total Score captures significant differences from many classical metrics (97% confidence across 66% of metrics), and ICCCM analyses provide nuanced insights into which instances truly validate model performance. The study demonstrates that IRT augments confusion-matrix analysis, enabling context-aware model selection and exposing limitations of aggregate metrics, with future work extending to more datasets and data-complexity-aware evaluation metrics.

Abstract

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.
Paper Structure (8 sections, 3 equations, 8 figures, 3 tables)

This paper contains 8 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example of Item Characteristic Curve.
  • Figure 2: Flowchart of the methodology.
  • Figure 3: Heart-Statlog item parameter histograms. The orange bars are the instances of the minority class while the blue ones are the majority class.
  • Figure 4: Heart-Statlog instances arranged by item parameters.
  • Figure 5: Nemenyi test heatmap for evaluation metrics.
  • ...and 3 more figures