Table of Contents
Fetching ...

Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

Andrei Jelea, Ahmed Nabil Belbachir, Marius Leordeanu

TL;DR

The paper presents Generalized Test-Time Augmentation (GTTA), a PCA-subspace perturbation technique that generates diverse, data-distribution-consistent test samples via Gaussian noise and averages their model outputs to improve performance across vision and non-vision tasks. It provides theoretical guarantees showing that GTTA can reduce the initial error and increase transformation diversity, while also removing structured noise through subspace decorrelation. A key innovation is a self-supervised distillation stage where the GTTA ensemble teaches a single model on unlabeled data, achieving ensemble-like accuracy with a single forward pass. The approach is validated on varied tasks (classification, segmentation, regression, speech) and challenging domains (underwater fish segmentation with the DeepSalmon dataset), highlighting GTTA’s generality, uncertainty-informed weighting of pseudo-labels, and practical test-time efficiency.

Abstract

We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA creates valid augmented samples from the data distribution with high diversity, properties we theoretically show that are essential for a Test-Time Augmentation method to be effective. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost. Our comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, pneumonia detection, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

TL;DR

The paper presents Generalized Test-Time Augmentation (GTTA), a PCA-subspace perturbation technique that generates diverse, data-distribution-consistent test samples via Gaussian noise and averages their model outputs to improve performance across vision and non-vision tasks. It provides theoretical guarantees showing that GTTA can reduce the initial error and increase transformation diversity, while also removing structured noise through subspace decorrelation. A key innovation is a self-supervised distillation stage where the GTTA ensemble teaches a single model on unlabeled data, achieving ensemble-like accuracy with a single forward pass. The approach is validated on varied tasks (classification, segmentation, regression, speech) and challenging domains (underwater fish segmentation with the DeepSalmon dataset), highlighting GTTA’s generality, uncertainty-informed weighting of pseudo-labels, and practical test-time efficiency.

Abstract

We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA creates valid augmented samples from the data distribution with high diversity, properties we theoretically show that are essential for a Test-Time Augmentation method to be effective. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost. Our comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, pneumonia detection, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

Paper Structure

This paper contains 22 sections, 3 theorems, 9 equations, 8 figures, 6 tables, 2 algorithms.

Key Result

Proposition 1

Let $\mathcal{T}$ be the set of GTTA transformations (augmentations) $\mathcal{T}_i \in \mathcal{T} \; i\in \{1 \dots n\}$ applied to an input sample $\mathbf{x} \sim D$ at test time and $\epsilon(f(\mathbf{x}))$ be the error with respect to ground truth for any output function $f$, given the input

Figures (8)

  • Figure 1: Relationship between the standard deviation (measure of variation, inconsistency) among the ensemble candidate outputs per pixel and their mean absolute error, with respect to ground truth, on DeepSalmon test set, for multiple noise levels added to the input sample (using the first noise adding strategy). The plot clearly shows, for all noise levels, that the higher standard deviation in the outputs (which can always be measured at test time), the higher true error (which is not known at test time) will be. Or, conversely, the stronger the consensus among candidates, the better the output. Based on this observation, we will use the standard deviation as a measure of certainty, that is of trust in the ensemble output - which is effective for self-supervised learning where the ensemble acts as a teacher for the initial single-model student.
  • Figure 2: Top $30$ eigenvalues of the sample covariance matrix over DeepSalmon test set for GTTA, color jittering and AugMix methods. Number of samples is $N = 100$. Note how the candidates produced by GTTA are the most uncorrelated and thus, diverse. The inter-dependence of the other TTA methods is due to the fewer degrees of freedom of those respective transformations, that automatically results in a less diverse population of candidates, in which the structure noises have better chances to survive. For example, color jittering, which is defined by a few global parameters for the entire image cannot destroy a specific shape in the background clutter, while GTTA, with its purely random noise in the class subspace can. We apply noise equally to each component in our approach.
  • Figure 3: Examples of augmented versions of a test image from DeepSalmon dataset (shown in a), with a manually inserted structural distractor in the form of a circle, as produced by three different TTA methods: (b) color jittering, (c) AugMix, (d) GTTA. Note that only GTTA removes the added structured noise.
  • Figure 4: Estimator bias, variance and error evolution over DeepSalmon validation set for different noise level $\sigma$ values when a constant (a) or incremental (b)std strategy is used for our GTTA method. Note how bias first decreases with larger std. This indicates that a small amount of added noise is beneficial, as it has the ability to remove the potentially harmful structured noise in the data. As the amount of added noise increases over a threshold, it becomes too large and it starts destroying the good signal and structures in the data as well - those that are relevant for the given task and classes of interest. Also note that the variance is much smaller than the bias, and it can always be reduced towards zero by increasing the number of generated GTTA input samples.
  • Figure 5: F-scores over blurred versions of test images from COCO dataset for initial Mask2Former model, GTTA and Color jittering TTA, using different levels of blur. Note how the GTTA advantage over color jittering increases as the image quality degrades.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Proposition 3