On Evaluation of Vision Datasets and Models using Human Competency Frameworks

Rahul Ramachandran; Tejal Kulkarni; Charchit Sharma; Deepak Vijaykeerthy; Vineeth N Balasubramanian

On Evaluation of Vision Datasets and Models using Human Competency Frameworks

Rahul Ramachandran, Tejal Kulkarni, Charchit Sharma, Deepak Vijaykeerthy, Vineeth N Balasubramanian

TL;DR

This paper explores Item Response Theory (IRT), a framework that infers interpretable latent parameters for an ensemble of models and each dataset item, enabling richer evaluation and analysis beyond the single accuracy number.

Abstract

Evaluating models and datasets in computer vision remains a challenging task, with most leaderboards relying solely on accuracy. While accuracy is a popular metric for model evaluation, it provides only a coarse assessment by considering a single model's score on all dataset items. This paper explores Item Response Theory (IRT), a framework that infers interpretable latent parameters for an ensemble of models and each dataset item, enabling richer evaluation and analysis beyond the single accuracy number. Leveraging IRT, we assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.

On Evaluation of Vision Datasets and Models using Human Competency Frameworks

TL;DR

Abstract

Paper Structure (12 sections, 12 equations, 10 figures, 3 tables)

This paper contains 12 sections, 12 equations, 10 figures, 3 tables.

Introduction
Item Response Theory
Experiments
Assessing Model Calibration
Dataset Complexity
Data Selection
Limitations and Research Directions
More about IRT
Variational Inference
Multidimensional IRT Models
Continuous IRT Models
Experiments with Ensembles

Figures (10)

Figure 1: 3PL ICC for image with $b=5$
Figure 2: Overall Workflow
Figure 3: Percentage of images with annotation errors and class overlap for given values of overconfidence across different models: (Top) ResNet-18 (20 epochs) (Middle) ResNet-50 (100 epochs) (Bottom) ViT
Figure 4: Class-wise median guessing vs. class-wise median difficulty and discriminability of the Gaussian noise corruption: (Top) Severity 1 (Bottom) Severity 5
Figure 5: Correlation of rankings on small subset with overall rankings
...and 5 more figures

On Evaluation of Vision Datasets and Models using Human Competency Frameworks

TL;DR

Abstract

On Evaluation of Vision Datasets and Models using Human Competency Frameworks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)