Diminishing Returns in Self-Supervised Learning
Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu
TL;DR
This study interrogates how pre-training, intermediate fine-tuning, and downstream fine-tuning contribute to performance for a small Vision Transformer (ViNy) in semantic segmentation. Using SimMIM pre-training on ImageNet-1K, an auxiliary intermediate task on Intel Image Classification, and downstream evaluation on Oxford-IIIT Pets, the authors systematically vary pre-training data and downstream data sizes. They find that pre-training and downstream fine-tuning provide gains, especially with limited downstream data, but both exhibit diminishing returns; in contrast, intermediate fine-tuning consistently harms performance, highlighting potential misalignment between intermediate tasks and the final objective. The work suggests that for small models, careful data selection and avoiding poorly aligned auxiliary tasks yield better efficiency and performance, guiding practical deployment in low-resource settings.
Abstract
While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
