Table of Contents
Fetching ...

MV-MR: multi-views and multi-representations for self-supervised learning and knowledge distillation

Vitaliy Kinakh, Mariia Drozdova, Slava Voloshynovskiy

TL;DR

MV-MR introduces a dependence-based regularization framework for self-supervised learning (SSL) that maximizes the relationship between augmented and non-augmented embeddings and also between augmented embeddings and multiple hand-crafted representations. By combining an MI upper-bound inspired loss with distance-covariance losses, MV-MR achieves collapse-free, non-contrastive SSL and enables model-agnostic knowledge distillation from capable teachers like CLIP, while remaining architecture-agnostic. The method delivers state-of-the-art results on STL10 and CIFAR-derived datasets among non-contrastive, clustering-free SSL methods and shows competitive performance on ImageNet-1K in linear and semi-supervised settings. The approach demonstrates a flexible framework for injecting invariant data representations into SSL via hand-crafted features and supports effective distillation to lightweight models, offering practical impact for scalable representation learning.

Abstract

We present a new method of self-supervised learning and knowledge distillation based on the multi-views and multi-representations (MV-MR). The MV-MR is based on the maximization of dependence between learnable embeddings from augmented and non-augmented views, jointly with the maximization of dependence between learnable embeddings from augmented view and multiple non-learnable representations from non-augmented view. We show that the proposed method can be used for efficient self-supervised classification and model-agnostic knowledge distillation. Unlike other self-supervised techniques, our approach does not use any contrastive learning, clustering, or stop gradients. MV-MR is a generic framework allowing the incorporation of constraints on the learnable embeddings via the usage of image multi-representations as regularizers. Along this line, knowledge distillation is considered a particular case of such a regularization. MV-MR provides the state-of-the-art performance on the STL10 and ImageNet-1K datasets among non-contrastive and clustering-free methods. We show that a lower complexity ResNet50 model pretrained using proposed knowledge distillation based on the CLIP ViT model achieves state-of-the-art performance on STL10 linear evaluation. The code is available at: https://github.com/vkinakh/mv-mr

MV-MR: multi-views and multi-representations for self-supervised learning and knowledge distillation

TL;DR

MV-MR introduces a dependence-based regularization framework for self-supervised learning (SSL) that maximizes the relationship between augmented and non-augmented embeddings and also between augmented embeddings and multiple hand-crafted representations. By combining an MI upper-bound inspired loss with distance-covariance losses, MV-MR achieves collapse-free, non-contrastive SSL and enables model-agnostic knowledge distillation from capable teachers like CLIP, while remaining architecture-agnostic. The method delivers state-of-the-art results on STL10 and CIFAR-derived datasets among non-contrastive, clustering-free SSL methods and shows competitive performance on ImageNet-1K in linear and semi-supervised settings. The approach demonstrates a flexible framework for injecting invariant data representations into SSL via hand-crafted features and supports effective distillation to lightweight models, offering practical impact for scalable representation learning.

Abstract

We present a new method of self-supervised learning and knowledge distillation based on the multi-views and multi-representations (MV-MR). The MV-MR is based on the maximization of dependence between learnable embeddings from augmented and non-augmented views, jointly with the maximization of dependence between learnable embeddings from augmented view and multiple non-learnable representations from non-augmented view. We show that the proposed method can be used for efficient self-supervised classification and model-agnostic knowledge distillation. Unlike other self-supervised techniques, our approach does not use any contrastive learning, clustering, or stop gradients. MV-MR is a generic framework allowing the incorporation of constraints on the learnable embeddings via the usage of image multi-representations as regularizers. Along this line, knowledge distillation is considered a particular case of such a regularization. MV-MR provides the state-of-the-art performance on the STL10 and ImageNet-1K datasets among non-contrastive and clustering-free methods. We show that a lower complexity ResNet50 model pretrained using proposed knowledge distillation based on the CLIP ViT model achieves state-of-the-art performance on STL10 linear evaluation. The code is available at: https://github.com/vkinakh/mv-mr
Paper Structure (22 sections, 12 equations, 2 figures, 9 tables)

This paper contains 22 sections, 12 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: MV-MR: proposed SSL approach. Two views of the image are produced: one original and the other augmented by $q_{\phi_{\mathrm{t}}}(\tilde{\mathbf{x}}| \mathbf{x})$. Then, this first view is encoded via encoder $q_{\phi_{\mathbf{z}}}(\mathbf{z}|\mathbf{x})$, producing ${\bf z}_i$, which denotes an original embedding, and via $q_{\phi_{\mathbf{z}}}(\tilde{\mathbf{z}}|\tilde{\mathbf{x}})$, producing $\tilde{\mathbf{z}}_{i}$, denoting an augmented one. The representations $\mathbf{z}_{i, k}^{*}$ are obtained via $K$ hand-crafted feature extraction mappers $q_{\phi_{z_{k}^{*}}}\left(\mathbf{z}_{k}^{*} | \mathbf{x}\right), 1\leq k \leq K$. The same process is applied to each image ${\bf x}_i$ in the batch $1 \leq i \leq B$. The embedding is regularized by a loss $\mathcal{L}_1(\phi_{\mathrm{z}})$, minimizing the Euclidean distances between the embeddings ${\bf z}_i$ and $\tilde{\mathbf{z}}_{i}$ while ensuring that their variance is above a threshold. The loss $\mathcal{L}_2(\phi_{\mathrm{z}})$ ensures the dependence between the pair of augmented and non-augmented embeddings using the distance correlation. The regularization loss $\mathcal{L}_3(\phi_{\mathrm{z}})$ is imposed by maximizing the distance correlation between the augmented embedding $\tilde{\mathbf{z}}_{i}$ and a set of hand-crafted features $\mathbf{z}_{i, k}^{*}, 1 \leq k \leq K$ computed for the given batch $B$.
  • Figure 2: MV-MR: distillation approach. $q_{\phi_{z^{*}}}(\textbf{z}^{*}|\textbf{x})$ is the high-complexity (in term of parameters) teacher model used as a feature extractor in order to train a low-complexity student model $q_{\phi_{z}}\left(\mathbf{z} | \mathbf{x}\right)$. The teacher model corresponds to a set of hand-crafted feature extractors in Figure \ref{['fig:d_cor_ssl']}. The representations $\mathbf{z}_{i}^{*}$ are obtained from the pretrained teacher model $q_{\phi_{z^{*}}}(\textbf{z}^{*}|\textbf{x})$. The same losses as in self-supervised pretraining are used: $\mathcal{L}_1(\phi_{\mathrm{z}})$ minimizes the Euclidean distances between the embeddings ${\bf z}_i$ and $\tilde{\mathbf{z}}_{i}$ while ensuring that their variance is above a threshold, $\mathcal{L}_2(\phi_{\mathrm{z}})$ ensures the dependence between the pair of augmented and non-augmented embeddings using the distance correlation, and $\mathcal{L}_3(\phi_{\mathrm{z}})$ maximizes the distance correlation between the augmented embedding $\tilde{\mathbf{z}}_{i}$ and the teacher's embeddings $\mathbf{z}_{i}^{*}$.