Table of Contents
Fetching ...

GenFormer -- Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets

Sven Oehri, Nikolas Ebert, Ahmed Abdullah, Didier Stricker, Oliver Wasenmüller

TL;DR

GenFormer introduces downstream-aware generative data augmentation to empower Vision Transformers on small datasets, addressing data scarcity and robustness gaps. By training a generator on real data to produce synthetic samples and mixing them with real data before downstream training, GenFormer leverages the transformer’s inherent robustness while mitigating overfitting. The approach is evaluated across Tiny ImageNet variants, CIFAR datasets, and domain-specific corpora (MedMNIST-C and EuroSAT-C), showing consistent gains in both accuracy and robustness, particularly for ViTs, and achieving state-of-the-art results when combined with standard augmentation and distillation techniques. The work also provides a thorough ablation of generative models and data-volume effects, highlighting diffusion models as preferred generators and demonstrating scalability to larger ViTs, thereby bridging the performance gap with CNNs in data-scarce contexts.

Abstract

Recent studies showcase the competitive accuracy of Vision Transformers (ViTs) in relation to Convolutional Neural Networks (CNNs), along with their remarkable robustness. However, ViTs demand a large amount of data to achieve adequate performance, which makes their application to small datasets challenging, falling behind CNNs. To overcome this, we propose GenFormer, a data augmentation strategy utilizing generated images, thereby improving transformer accuracy and robustness on small-scale image classification tasks. In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet by transferring established ImageNet generalization and robustness benchmarks to the small-scale data domain. Similarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants of established fine-grained datasets in the medical and aerial domain. Through a series of experiments conducted on small datasets of various domains, including Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate the synergistic power of our method, in particular when combined with common train and test time augmentations, knowledge distillation, and architectural design choices. Additionally, we prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness, bridging the gap between CNNs and ViTs in the small-scale dataset domain.

GenFormer -- Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets

TL;DR

GenFormer introduces downstream-aware generative data augmentation to empower Vision Transformers on small datasets, addressing data scarcity and robustness gaps. By training a generator on real data to produce synthetic samples and mixing them with real data before downstream training, GenFormer leverages the transformer’s inherent robustness while mitigating overfitting. The approach is evaluated across Tiny ImageNet variants, CIFAR datasets, and domain-specific corpora (MedMNIST-C and EuroSAT-C), showing consistent gains in both accuracy and robustness, particularly for ViTs, and achieving state-of-the-art results when combined with standard augmentation and distillation techniques. The work also provides a thorough ablation of generative models and data-volume effects, highlighting diffusion models as preferred generators and demonstrating scalability to larger ViTs, thereby bridging the performance gap with CNNs in data-scarce contexts.

Abstract

Recent studies showcase the competitive accuracy of Vision Transformers (ViTs) in relation to Convolutional Neural Networks (CNNs), along with their remarkable robustness. However, ViTs demand a large amount of data to achieve adequate performance, which makes their application to small datasets challenging, falling behind CNNs. To overcome this, we propose GenFormer, a data augmentation strategy utilizing generated images, thereby improving transformer accuracy and robustness on small-scale image classification tasks. In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet by transferring established ImageNet generalization and robustness benchmarks to the small-scale data domain. Similarly, we introduce MedMNIST-C and EuroSAT-C as corrupted test set variants of established fine-grained datasets in the medical and aerial domain. Through a series of experiments conducted on small datasets of various domains, including Tiny ImageNet, CIFAR, EuroSAT and MedMNIST datasets, we demonstrate the synergistic power of our method, in particular when combined with common train and test time augmentations, knowledge distillation, and architectural design choices. Additionally, we prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness, bridging the gap between CNNs and ViTs in the small-scale dataset domain.
Paper Structure (19 sections, 16 figures, 7 tables)

This paper contains 19 sections, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Comparison of the error rate (left) and mean corruption error (right) of DeiT touvron2021training on CIFAR krizhevsky2009learning and parts of the MedMNIST medmnistv2 collection with and without our GenFormer. Lower error rates closer to the plot center are better.
  • Figure 2: The proposed GenFormer approach involves training a downstream-aware image generation model, $\mathcal{G}_{\Theta}$, using real data $D_{real}$, then augmenting this dataset with generated data $D_{gen}$ to create $D_{mix}$. Subsequently, $\mathcal{C}_{\Theta}$ is trained on $D_{mix}$ for the classification task, with optional methods like data augmentation or knowledge distillation during training. $\oplus$ denotes a concatenation.
  • Figure 3: Real (left) and generated (right) sample pairs of corresponding classes (f.l.t.r): CIFAR-100 krizhevsky2009learning, BreastMNIST, PneumoniaMNIST, OrganSMNIST medmnistv2 and EuroSAT helber2019eurosat.
  • Figure 4: Analysis of the influence of different amounts of training data on the accuracy and robustness of Vision Transformers touvron2021trainingwang2021pyramid. The networks are trained with {5, 10, 20, 50, 100}% of Tiny ImageNet le2015tiny. We add 100,000 generated images (from a diffusion model trained on the same amount) to each train-set.
  • Figure 5: Analysis of the impact of the duration of training (blue line) versus the number of data (red line) of DeiT-Ti touvron2021training on CIFAR-10(-C)krizhevsky2009learninghendrycks2018benchmarking.
  • ...and 11 more figures