Table of Contents
Fetching ...

Variational Self-Supervised Learning

Mehmet Can Yavuz, Berrin Yanikoglu

TL;DR

This paper tackles decoder-based generative overhead in latent representation learning by introducing Variational Self-Supervised Learning (VSSL), a decoder-free framework that pairs two encoders with Gaussian latent posteriors and a momentum-updated teacher to define a data-dependent prior. It replaces the conventional reconstruction term in the ELBO with a cross-view denoising objective and derives a self-supervised ELBO that remains analytically tractable due to Gaussian assumptions, while enhancing latent-space alignment with cosine-based KL and log-likelihood formulations. Training relies on a symmetric, EMA-based teacher-student setup without generative reconstruction, enabling scalable, probabilistically grounded SSL. Empirical results on CIFAR-10, CIFAR-100, and ImageNet-100 demonstrate that VSSL matches or exceeds state-of-the-art SSL methods (e.g., BYOL, MoCo V3) across online and offline evaluation protocols, highlighting its practical impact and potential to bridge variational modeling with modern SSL.

Abstract

We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.

Variational Self-Supervised Learning

TL;DR

This paper tackles decoder-based generative overhead in latent representation learning by introducing Variational Self-Supervised Learning (VSSL), a decoder-free framework that pairs two encoders with Gaussian latent posteriors and a momentum-updated teacher to define a data-dependent prior. It replaces the conventional reconstruction term in the ELBO with a cross-view denoising objective and derives a self-supervised ELBO that remains analytically tractable due to Gaussian assumptions, while enhancing latent-space alignment with cosine-based KL and log-likelihood formulations. Training relies on a symmetric, EMA-based teacher-student setup without generative reconstruction, enabling scalable, probabilistically grounded SSL. Empirical results on CIFAR-10, CIFAR-100, and ImageNet-100 demonstrate that VSSL matches or exceeds state-of-the-art SSL methods (e.g., BYOL, MoCo V3) across online and offline evaluation protocols, highlighting its practical impact and potential to bridge variational modeling with modern SSL.

Abstract

We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.

Paper Structure

This paper contains 13 sections, 12 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Directed graphical model for the VSSL framework. Observations $x_1$ and $x_2$ are encoded via parameterized inference networks $\theta_t$ and $\theta_s$, producing the latent representation $z$. The student path $q_{\theta_s}(z \mid x_2)$ updates the teacher path $q_{\theta_t}(z \mid x_1)$ through exponential moving average (EMA). A denoising network $p_\phi(\hat{z} \mid z)$ refines the latent, producing a denoised representation $\hat{z}$ for self-supervised learning objectives.
  • Figure 2: Overview of the variational self-supervised learning framework for unsupervised representation learning using variational objectives. An original image is augmented into two views, $t_i$ and $t_j$, processed by a momentum-updated teacher network and a student network, respectively. Both networks encode the image into feature vectors and variational distributions parameterized by $\mu$ and $\log\sigma$. The teacher outputs serve as a prior for the student’s posterior via a KL divergence minimization. Gaussian sampling from the student posterior allows further processing through an autoencoder, enforcing consistency and regularization in the learned latent space.