Table of Contents
Fetching ...

Diminishing Returns in Self-Supervised Learning

Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu

TL;DR

This study interrogates how pre-training, intermediate fine-tuning, and downstream fine-tuning contribute to performance for a small Vision Transformer (ViNy) in semantic segmentation. Using SimMIM pre-training on ImageNet-1K, an auxiliary intermediate task on Intel Image Classification, and downstream evaluation on Oxford-IIIT Pets, the authors systematically vary pre-training data and downstream data sizes. They find that pre-training and downstream fine-tuning provide gains, especially with limited downstream data, but both exhibit diminishing returns; in contrast, intermediate fine-tuning consistently harms performance, highlighting potential misalignment between intermediate tasks and the final objective. The work suggests that for small models, careful data selection and avoiding poorly aligned auxiliary tasks yield better efficiency and performance, guiding practical deployment in low-resource settings.

Abstract

While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.

Diminishing Returns in Self-Supervised Learning

TL;DR

This study interrogates how pre-training, intermediate fine-tuning, and downstream fine-tuning contribute to performance for a small Vision Transformer (ViNy) in semantic segmentation. Using SimMIM pre-training on ImageNet-1K, an auxiliary intermediate task on Intel Image Classification, and downstream evaluation on Oxford-IIIT Pets, the authors systematically vary pre-training data and downstream data sizes. They find that pre-training and downstream fine-tuning provide gains, especially with limited downstream data, but both exhibit diminishing returns; in contrast, intermediate fine-tuning consistently harms performance, highlighting potential misalignment between intermediate tasks and the final objective. The work suggests that for small models, careful data selection and avoiding poorly aligned auxiliary tasks yield better efficiency and performance, guiding practical deployment in low-resource settings.

Abstract

While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.

Paper Structure

This paper contains 13 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparing ViNy mIoU% across pre-training and fine-tuning data size (3 runs)
  • Figure 2: Plotting the mean and standard error of the difference in mIoU% with and without intermediate fine-tuning across 3 runs.
  • Figure A1: A semantic segmentation example on the Oxford III Pets data set. The left image is the input, the center image is the ground truth, and the right image is our model's prediction