Variational Self-Supervised Learning
Mehmet Can Yavuz, Berrin Yanikoglu
TL;DR
This paper tackles decoder-based generative overhead in latent representation learning by introducing Variational Self-Supervised Learning (VSSL), a decoder-free framework that pairs two encoders with Gaussian latent posteriors and a momentum-updated teacher to define a data-dependent prior. It replaces the conventional reconstruction term in the ELBO with a cross-view denoising objective and derives a self-supervised ELBO that remains analytically tractable due to Gaussian assumptions, while enhancing latent-space alignment with cosine-based KL and log-likelihood formulations. Training relies on a symmetric, EMA-based teacher-student setup without generative reconstruction, enabling scalable, probabilistically grounded SSL. Empirical results on CIFAR-10, CIFAR-100, and ImageNet-100 demonstrate that VSSL matches or exceeds state-of-the-art SSL methods (e.g., BYOL, MoCo V3) across online and offline evaluation protocols, highlighting its practical impact and potential to bridge variational modeling with modern SSL.
Abstract
We present Variational Self-Supervised Learning (VSSL), a novel framework that combines variational inference with self-supervised learning to enable efficient, decoder-free representation learning. Unlike traditional VAEs that rely on input reconstruction via a decoder, VSSL symmetrically couples two encoders with Gaussian outputs. A momentum-updated teacher network defines a dynamic, data-dependent prior, while the student encoder produces an approximate posterior from augmented views. The reconstruction term in the ELBO is replaced with a cross-view denoising objective, preserving the analytical tractability of Gaussian KL divergence. We further introduce cosine-based formulations of KL and log-likelihood terms to enhance semantic alignment in high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and ImageNet-100 show that VSSL achieves competitive or superior performance to leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a scalable, probabilistically grounded approach to learning transferable representations without generative reconstruction, bridging the gap between variational modeling and modern self-supervised techniques.
