Table of Contents
Fetching ...

Labels in Extremes: How Well Calibrated are Extreme Multi-label Classifiers?

Nasib Ullah, Erik Schultheis, Jinbin Zhang, Rohit Babbar

TL;DR

This paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets, and introduces the notion of ECE@k, which focusses on the top-$k$ probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios.

Abstract

Extreme multilabel classification (XMLC) problems occur in settings such as related product recommendation, large-scale document tagging, or ad prediction, and are characterized by a label space that can span millions of possible labels. There are two implicit tasks that the classifier performs: \emph{Evaluating} each potential label for its expected worth, and then \emph{selecting} the best candidates. For the latter task, only the relative order of scores matters, and this is what is captured by the standard evaluation procedure in the XMLC literature. However, in many practical applications, it is important to have a good estimate of the actual probability of a label being relevant, e.g., to decide whether to pay the fee to be allowed to display the corresponding ad. To judge whether an extreme classifier is indeed suited to this task, one can look, for example, to whether it returns \emph{calibrated} probabilities, which has hitherto not been done in this field. Therefore, this paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets. As naive application of Expected Calibration Error (ECE) leads to meaningless results in long-tailed XMC datasets, we instead introduce the notion of \emph{calibration@k} (e.g., ECE@k), which focusses on the top-$k$ probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios. While we find that different models can exhibit widely varying reliability plots, we also show that post-training calibration via a computationally efficient isotonic regression method enhances model calibration without sacrificing prediction accuracy. Thus, the practitioner can choose the model family based on accuracy considerations, and leave calibration to isotonic regression.

Labels in Extremes: How Well Calibrated are Extreme Multi-label Classifiers?

TL;DR

This paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets, and introduces the notion of ECE@k, which focusses on the top- probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios.

Abstract

Extreme multilabel classification (XMLC) problems occur in settings such as related product recommendation, large-scale document tagging, or ad prediction, and are characterized by a label space that can span millions of possible labels. There are two implicit tasks that the classifier performs: \emph{Evaluating} each potential label for its expected worth, and then \emph{selecting} the best candidates. For the latter task, only the relative order of scores matters, and this is what is captured by the standard evaluation procedure in the XMLC literature. However, in many practical applications, it is important to have a good estimate of the actual probability of a label being relevant, e.g., to decide whether to pay the fee to be allowed to display the corresponding ad. To judge whether an extreme classifier is indeed suited to this task, one can look, for example, to whether it returns \emph{calibrated} probabilities, which has hitherto not been done in this field. Therefore, this paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets. As naive application of Expected Calibration Error (ECE) leads to meaningless results in long-tailed XMC datasets, we instead introduce the notion of \emph{calibration@k} (e.g., ECE@k), which focusses on the top- probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios. While we find that different models can exhibit widely varying reliability plots, we also show that post-training calibration via a computationally efficient isotonic regression method enhances model calibration without sacrificing prediction accuracy. Thus, the practitioner can choose the model family based on accuracy considerations, and leave calibration to isotonic regression.

Paper Structure

This paper contains 16 sections, 12 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Reliability plots at k=3 across different XMLC Models evaluated on the Amazon-670K Dataset. Different models show qualitatively different calibration behaviour.
  • Figure 2: Calibration Effects on Top-k Prediction Probabilities and Reliability. Top-$k$ prediction histograms (top row: uncalibrated, third row: calibrated) and reliability plots (second row: uncalibrated, bottom row: calibrated) for RENEE jain2023renee (representative of Transformer and deep XMLC models), DISMEC babbar2017dismec (representative of linear and PLT-based models) on Amazon-670K, and NGAME NGAME (representative of two-tower label feature-based XMLC models) on LF-WikiSeeAlso-320K.
  • Figure 3: First Row: Comparative analysis of calibration error metrics and P@5 performance between methods with label features (GalaXC GalaXC, NGAME NGAME , Renee jain2023renee, Gandalf kharbanda2024learning) and without label features (PLT, LIGHTXML jiang2021lightxml) on the LF-AmazonTitles-131K dataset. Second and Third Rows: Top-$k$ probability prediction patterns for NGAME NGAME and Gandalf kharbanda2024learning
  • Figure 4: Impact of Meta Classifier Strategies on Calibration: Comparison of (i) fixed vs. trainable meta classifier assignment and (ii) single vs. multi-resolution meta classifiers.
  • Figure 5: Impact of Calibration on Label Scaling: ECE@$K$ vs. Label Size for Renee Model.
  • ...and 14 more figures