Table of Contents
Fetching ...

Investigating Data Usage for Inductive Conformal Predictors

Yizirui Fang, Anthony Bellotti

TL;DR

The paper addresses how to allocate scarce development data for inductive conformal predictors (ICPs) and whether overlapping training and calibration sets affects validity. It adopts an ANN-based probabilistic predictor and evaluates ICP performance on the Covtype dataset under three partitioning scenarios, using metrics like coverage, bias, and width across 200 runs and multiple confidence levels. The findings reveal that small calibration sets can improve efficiency but increase variance, and overlap between training and calibration can cause invalid ICP and negative bias in data-scarce settings, though higher confidence levels show more robustness with larger datasets. These results offer practical guidance for ICP deployment in data-limited and safety-critical contexts and point to future work on extending analyses to imbalanced/multi-class and regression tasks.

Abstract

Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibration and test. With access to limited or expensive development data, it is an open question regarding the most efficient way to divide the data. This study provides several experiments to explore this question and consider the case for allowing overlap of examples between training and calibration sets. Conclusions are drawn that will be of value to academics and practitioners planning to use ICPs.

Investigating Data Usage for Inductive Conformal Predictors

TL;DR

The paper addresses how to allocate scarce development data for inductive conformal predictors (ICPs) and whether overlapping training and calibration sets affects validity. It adopts an ANN-based probabilistic predictor and evaluates ICP performance on the Covtype dataset under three partitioning scenarios, using metrics like coverage, bias, and width across 200 runs and multiple confidence levels. The findings reveal that small calibration sets can improve efficiency but increase variance, and overlap between training and calibration can cause invalid ICP and negative bias in data-scarce settings, though higher confidence levels show more robustness with larger datasets. These results offer practical guidance for ICP deployment in data-limited and safety-critical contexts and point to future work on extending analyses to imbalanced/multi-class and regression tasks.

Abstract

Inductive conformal predictors (ICPs) are algorithms that are able to generate prediction sets, instead of point predictions, which are valid at a user-defined confidence level, only assuming exchangeability. These algorithms are useful for reliable machine learning and are increasing in popularity. The ICP development process involves dividing development data into three parts: training, calibration and test. With access to limited or expensive development data, it is an open question regarding the most efficient way to divide the data. This study provides several experiments to explore this question and consider the case for allowing overlap of examples between training and calibration sets. Conclusions are drawn that will be of value to academics and practitioners planning to use ICPs.
Paper Structure (12 sections, 9 equations, 8 figures)

This paper contains 12 sections, 9 equations, 8 figures.

Figures (8)

  • Figure 1: An example multilayer ANN with 1 hidden layer of $5$ neurons zhang2021dive
  • Figure 2: Mean conformity score for each overlap size with 95% confidence interval in Experiment C.
  • Figure 3: Experiment A: diff (left), bias (middle), width (right) for different training set size (horizontal axes) and confidence levels, with 95% confidence intervals. Continued on next page.
  • Figure 4: Experiment A: diff (left), bias (middle), width (right) for different training set size (horizontal axes) and confidence levels, with 95% confidence intervals.
  • Figure 5: Experiment B: diff (left), bias (middle), width (right) for different combined calibration and training set size (horizontal axes) and confidence levels, with 95% confidence intervals. Continued on next page.
  • ...and 3 more figures