Joint Embedding Variational Bayes
Amin Oji, Paul Fieguth
TL;DR
VJE introduces a normalized probabilistic formulation for non-contrastive self-supervised learning by positing a latent-variable model over encoder embeddings and optimizing a symmetric conditional ELBO. The likelihood is factorized into directional and radial components via a polar decomposition and is modeled with a heavy-tailed Student-$t$ distribution, with feature-wise uncertainty captured through a shared diag($oldsymbol{ u}$) variance that ties the posterior and likelihood. An asymmetric encoder–inference/target setup with stop-gradient enables fixed-observation conditioning, yielding non-degenerate posteriors and enabling density-based anomaly scoring. Empirically, VJE achieves competitive representation quality on ImageNet-1K and CIFAR/STL while providing coherent probabilistic semantics, demonstrated by strong one-class anomaly detection performance on CIFAR-10 and robust ablations. This work offers a principled alternative to energy-based, pointwise non-contrastive objectives by grounding representation learning in normalized probabilistic modelling and uncertainty quantification.
Abstract
We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student-$t$ model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.
