Table of Contents
Fetching ...

Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

Raphael Lafargue, Luke Smith, Franck Vermet, Mathias Löwe, Ian Reid, Vincent Gripon, Jack Valmadre

TL;DR

This research demonstrates that the use of paired tests can partially address the issue of misleading confidence intervals in FSL comparative studies and explores methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size.

Abstract

The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again

Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

TL;DR

This research demonstrates that the use of paired tests can partially address the issue of misleading confidence intervals in FSL comparative studies and explores methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size.

Abstract

The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again
Paper Structure (24 sections, 21 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 21 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Scatter plot of task accuracies using two different combinations of feature extractor/adaptation methods on the Traffic Signs benchmark.
  • Figure 2: Variance of the average accuracy vs. the number of queries with synthetic data. The two classes are represented as 1D Gaussians $\mathcal{N}(-1, 1)$ and $\mathcal{N}(1, 1)$. The size of the dataset is $N=1000$ (500 samples per class). Tasks are sampled according to Algorithm \ref{['alg:no_repl']}. The number of shots is set to 5. We fit this with the model described in Equation \ref{['eq:model_main_text']} and observe a strong fit of the model with our experiment.
  • Figure 3: Variance of the average accuracy $\bar{A}$ and number of tasks $T$ across different settings of $S$ and $N$, derived from synthetic datasets featuring two 1D Gaussian classes, $\mathcal{N}(-1, 1)$ and $\mathcal{N}(1, 1)$. The left pair of graphs display results with a fixed number of shots ($S=5$), while the right pair of graphs show results for a constant sample size in the synthetic dataset ($N=1000$).
  • Figure 4: (Left) Confidence Interval ranges (Right) Corresponding number of tasks generated. In all graphs the x-axis is the number of queries $Q$. These results represent averages from multiple trials, with the number of trials tailored according to the Task Count ($T$). Some curves are stopped before $Q$ reaches 100 because of the number of samples per class. We do not show Omniglot and Quickdraw for visibility.
  • Figure 5: This figure shows that the value of $Q^*$ is not dependent on the model used. This experiment is conducted using DINO v2 instead of CLIP. (Left) Confidence Interval ranges (Right) Corresponding number of tasks generated. In all graphs the x-axis is the number of queries $Q$. These results represent averages from multiple trials on DINO v2, with the number of trials tailored according to the Task Count ($T$). Some curves are stopped before $Q$ reaches 100 because of the number of samples per class. We do not show Omniglot and Quickdraw for visibility.
  • ...and 1 more figures