Table of Contents
Fetching ...

Escaping The Big Data Paradigm in Self-Supervised Representation Learning

Carlos Vélez García, Miguel Cazorla, Jorge Pomares

TL;DR

This work tackles the data efficiency challenge in self-supervised vision by introducing SCOTT, a Sparse Convolutional Tokenizer that injects CNN-like inductive biases into Vision Transformers, and MIM-JEPA, a Joint-Embedding Predictive Architecture implemented within a Masked Image Modeling framework. SCOTT enables ViTs to operate effectively in small-data regimes, while MIM-JEPA learns meaningful latent representations by predicting masked patch encodings from a target encoder with EMA updating, minimizing a Smooth-L1 loss on masked regions. The authors validate the approach on three small, fine-grained, standard-resolution datasets (Flowers-102, Pets-37, ImageNet-100), showing that frozen SCOTT models pretrained with MIM-JEPA outperform fully supervised baselines and achieve competitive results with state-of-the-art methods that rely on large-scale pretraining. This demonstrates that robust, transferable representations can be learned with limited data, compute, and model size, facilitating applications in data-constrained domains like medical imaging and robotics. The work broadens access to powerful SSL in computer vision by reducing its reliance on vast datasets and heavy computational resources, while maintaining strong performance.

Abstract

The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.

Escaping The Big Data Paradigm in Self-Supervised Representation Learning

TL;DR

This work tackles the data efficiency challenge in self-supervised vision by introducing SCOTT, a Sparse Convolutional Tokenizer that injects CNN-like inductive biases into Vision Transformers, and MIM-JEPA, a Joint-Embedding Predictive Architecture implemented within a Masked Image Modeling framework. SCOTT enables ViTs to operate effectively in small-data regimes, while MIM-JEPA learns meaningful latent representations by predicting masked patch encodings from a target encoder with EMA updating, minimizing a Smooth-L1 loss on masked regions. The authors validate the approach on three small, fine-grained, standard-resolution datasets (Flowers-102, Pets-37, ImageNet-100), showing that frozen SCOTT models pretrained with MIM-JEPA outperform fully supervised baselines and achieve competitive results with state-of-the-art methods that rely on large-scale pretraining. This demonstrates that robust, transferable representations can be learned with limited data, compute, and model size, facilitating applications in data-constrained domains like medical imaging and robotics. The work broadens access to powerful SSL in computer vision by reducing its reliance on vast datasets and heavy computational resources, while maintaining strong performance.

Abstract

The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.

Paper Structure

This paper contains 25 sections, 2 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Matching different semantic parts across categories and poses. We show the first 3 components of a PCA computed among the token embeddings of images from the same column (a, b, and c). The background is removed by thresholding the first component. Notably, semantically similar parts are matched by color despite belonging to different object classes and poses. For instance: in (a) animal claws are purple and torso pink, in (b) wings are green and torso red. Interestingly, once background is removed in (c), different flower disks are matched to different colors.
  • Figure 2: MIM-JEPA. An image $I_{full}$ is processed by the target-encoder $f_{\bar{\theta}}$ to produce a latent patch-level representation $s^y$, whose masked patches $M$ are used as targets; The context image $I_{masked}$, generated from the complement of $M$, is input to the context-encoder $f_{\theta}$ to produce $s^x$. The predictor $f_\phi$ is fed with $s^x$ to predict the missing content $\hat{s}^y$. The Smooth-L1 loss is computed only on the (black) masked patches in latent space to update the context-encoder and predictor weights (dashed line), while the target encoder's weights are updated via an exponential moving average (EMA) of the context-encoder (dotted line).
  • Figure 3: Visualization of the first PCA components. We compute a PCA between the patches from all images in the first row. A semantic class segmentation emerges in pink, the background is removed by thresholding the first component. A second PCA among remaining object's patches reveals different objects parts: the head in purple, the torso in yellow or the wings in red. Similar to Figure \ref{['fig:pca_many']} (c), the two rightmost columns segment several ducks, potentially enabling object counting.