Table of Contents
Fetching ...

Test-time augmentation improves efficiency in conformal prediction

Divya Shanmugam, Helen Lu, Swami Sankaranarayanan, John Guttag

TL;DR

This work tackles the inefficiency of conformal prediction, where uncertainty sets can be overly large. It introduces test-time augmentation (TTA) into the conformal pipeline by learning an augmentation policy and an aggregation of TTA-transformed predictions, preserving exchangeability and the coverage guarantee. Empirical results across ImageNet, iNaturalist, and CUB-Birds show consistent reductions in average and class-conditional prediction-set sizes (up to 14% on average, with larger gains in harder classes) while maintaining or modestly improving coverage and adaptivity. The method is scalable, requires no model retraining, and demonstrates robustness under distribution shifts, offering a practical path toward more efficient conformal predictors in vision tasks.

Abstract

A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.

Test-time augmentation improves efficiency in conformal prediction

TL;DR

This work tackles the inefficiency of conformal prediction, where uncertainty sets can be overly large. It introduces test-time augmentation (TTA) into the conformal pipeline by learning an augmentation policy and an aggregation of TTA-transformed predictions, preserving exchangeability and the coverage guarantee. Empirical results across ImageNet, iNaturalist, and CUB-Birds show consistent reductions in average and class-conditional prediction-set sizes (up to 14% on average, with larger gains in harder classes) while maintaining or modestly improving coverage and adaptivity. The method is scalable, requires no model retraining, and demonstrates robustness under distribution shifts, offering a practical path toward more efficient conformal predictors in vision tasks.

Abstract

A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)--a technique that introduces inductive biases during inference--reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.

Paper Structure

This paper contains 40 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: We illustrate the addition of test-time augmentation to conformal calibration in green (left) and provide a snapshot of the improvements it can confer (right). We show results on Imagenet, with a desired coverage of 95%, for the 20 classes with the largest predicted set sizes on average (computed over 10 calibration/test splits).
  • Figure 2: Robustness to distribution shift. We compare average prediction set size achieved by RAPS (yellow) to average prediction set size achieved when combining RAPS with TTA-Learned (green). Results reflect the distribution of average prediction set size across 10 runs using ImageNet and ResNet50. We evaluate performance on different corruptions (x-axis) and different coverage guarantees (left, middle, right). RAPS+TTA-Learned (green) produces a noticeable reduction in prediction set size, even when subject to distribution shift, with no loss in coverage. Refer to Figure \ref{['fig:supp_coverage_robustness']} in the supplement for a comparison of coverage achieved by both methods.
  • Figure 3: (A) Class-conditional prediction set sizes. We plot the distribution of class-conditional prediction set sizes, for ImageNet and ResNet-50 with $\alpha = .01$. RAPS+TTA-Learned (green) produces a noticeable reduction in class-conditional prediction set sizes. (B, C) Relationship between TTA improvements and original class set sizes and class difficulty. TTA introduces the largest improvements for classes with the largest original prediction set sizes (B) and classes on which the underlying classifier is often incorrect (C). Each point represents the average prediction set size for each class, across 10 splits.
  • Figure 4: (A) Effect of TTA-Learned on optimal Top-K: TTA-Learned significantly lowers the value of k required for Top-k prediction sets to achieve coverage on ImageNet and iNaturalist, but not on CUB-Birds. (B,C) Effect of TTA-Learned on rank of true class: TTA-Learned improves the rank of the true class among the sorted predicted probabilities for a given example for both ImageNet (B) and iNaturalist (C). We plot the rank using the original predicted probabilities compared to the TTA-transformed probabilities, binning all examples in the validation set into five equal-width bins. Dots that fall below the red line indicate that TTA improves the rank of the true class.
  • Figure S1: Impact on coverage. We plot achieved coverage for both RAPS and RAPS+TTA-Learned across several coverage guarantees and distribution shifts. As expected, distribution shift leads conformal predictors to not meet the coverage guarantee. In each case, the addition of TTA does not worsen coverage; in some cases (for example, given the contrast corruption and a coverage guarantee of 0.05) it even improves coverage.
  • ...and 6 more figures