Table of Contents
Fetching ...

Frozen Feature Augmentation for Few-Shot Image Classification

Andreas Bär, Neil Houlsby, Mostafa Dehghani, Manoj Kumar

TL;DR

This work shows that augmenting frozen vision-transformer features, rather than input images, with carefully designed FroFA transformations can improve few-shot image classification. By mapping features to a surrogate image-like space and applying per-channel and sequential augmentations, the authors identify brightness- and other stylistic transformations as the most effective, while geometric changes often harm performance on ImageNet. The study demonstrates consistent gains across multiple architectures and large pretraining datasets, especially on small transfer datasets, and finds that per-channel FroFA and sequential protocols can yield substantial boosts (up to ~7.7% in 1-shot). These results indicate that simple, computation-light feature-space augmentations are a practical route to boosting few-shot transfer when working with frozen representations. The findings also show good generalization across architectures and pretraining setups, suggesting broad applicability for rapid, data-efficient deployment of pretrained vision models.

Abstract

Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.

Frozen Feature Augmentation for Few-Shot Image Classification

TL;DR

This work shows that augmenting frozen vision-transformer features, rather than input images, with carefully designed FroFA transformations can improve few-shot image classification. By mapping features to a surrogate image-like space and applying per-channel and sequential augmentations, the authors identify brightness- and other stylistic transformations as the most effective, while geometric changes often harm performance on ImageNet. The study demonstrates consistent gains across multiple architectures and large pretraining datasets, especially on small transfer datasets, and finds that per-channel FroFA and sequential protocols can yield substantial boosts (up to ~7.7% in 1-shot). These results indicate that simple, computation-light feature-space augmentations are a practical route to boosting few-shot transfer when working with frozen representations. The findings also show good generalization across architectures and pretraining setups, suggesting broad applicability for rapid, data-efficient deployment of pretrained vision models.

Abstract

Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.
Paper Structure (33 sections, 8 equations, 6 figures, 10 tables)

This paper contains 33 sections, 8 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Few-shot results averaged across eight test sets, including ILSVRC-2012 Deng2009Russakovsky2015. We use cached features from an L/16 model Dosovitskiy2021 pretrained on JFT-3B Zhai2022 (left) or WebLI Chen2023 following a sigmoid language-image pretraining (SigLIP) Zhai2023 (right). Our method, i.e., a multi-head attention pooling Lee2019f head trained with weight decay (MAPwd) and frozen feature augmentation (FroFA), shows significant gains across all shots with respect to a weight-decayed MAP, i.e., MAPwd, or an L2-regularized linear probe baseline, both without FroFA.
  • Figure 2: Pipeline for caching and training on (frozen) features. (\ref{['fig:framework_overview-a']}): Given a (frozen) pretrained vision transformer, with $L$ Transformer blocks (TBs), a multi-head attention pooling (MAP) layer, and a classification layer (CL), we select its $L$-th Transformer block for caching. (\ref{['fig:framework_overview-b']}): Next, we feed images $\boldsymbol{x}\in\mathcal{D}_{\boldsymbol{x}}$ to cache (frozen) features $\boldsymbol{f}\in\mathcal{D}_{\boldsymbol{f}}$. (\ref{['fig:framework_overview-c']}): Finally, we use $\mathcal{D}_{\boldsymbol{f}}$ to train a lightweight model on top. We investigate frozen feature augmentation (FroFA) $\boldsymbol{a}_{\boldsymbol{f}}\in\mathcal{A}_{\boldsymbol{f}}$ in this scenario.
  • Figure 3: Average top-1 accuracy for FroFA variants on our ILSVRC-2012 test set. We use the L/16 JFT-3B base setup (cf.\ref{['sec:exp_results']}). We sweep across a base sweep (cf.\ref{['sec:training_details']}) to first find the best setting on our ILSVRC-2012 validation set for each FroFA operation point (cf. Appendix, \ref{['sec:appendix_augmentation']}). Shaded areas indicate standard errors collected via sampling each shot five times.
  • Figure 4: Average top-1 accuracy of brightness c$^2$FroFA for JFT-3B \ref{['fig:jft3b_all']} and ImageNet-21k \ref{['fig:i21k_all']} models on our ILSVRC-2012 test set trained on few-shotted ILSVRC-2012 training sets. Absolute gains to the weight-decayed MAP, i.e. MAPwd, and L2-regularized linear probe baseline are reported. Depending on the setting, we sweep across a base, cf.\ref{['sec:training_details']}, a weight decay or L2 decay, cf.\ref{['sec:baselines']}, and a brightness level sweep, cf.\ref{['sec:appendix_augmentation']}, to first find the best setting on our ILSVRC-2012 validation set for each model.
  • Figure 5: Average top-1 accuracy for patch dropout FroFA on our ILSVRC-2012 test set. We use the L/16 JFT-3B base setup (cf.\ref{['sec:exp_results']}). We sweep across a base sweep (cf.\ref{['sec:training_details']}) to first find the best setting on our ILSVRC-2012 validation set for each number of patches (cf.\ref{['sec:appendix_augmentation']}). Shaded areas indicate standard errors collected via sampling each shot five times.
  • ...and 1 more figures