Joint Embedding Variational Bayes

Amin Oji; Paul Fieguth

Joint Embedding Variational Bayes

Amin Oji, Paul Fieguth

TL;DR

VJE introduces a normalized probabilistic formulation for non-contrastive self-supervised learning by positing a latent-variable model over encoder embeddings and optimizing a symmetric conditional ELBO. The likelihood is factorized into directional and radial components via a polar decomposition and is modeled with a heavy-tailed Student-$t$ distribution, with feature-wise uncertainty captured through a shared diag($oldsymbol{ u}$) variance that ties the posterior and likelihood. An asymmetric encoder–inference/target setup with stop-gradient enables fixed-observation conditioning, yielding non-degenerate posteriors and enabling density-based anomaly scoring. Empirically, VJE achieves competitive representation quality on ImageNet-1K and CIFAR/STL while providing coherent probabilistic semantics, demonstrated by strong one-class anomaly detection performance on CIFAR-10 and robust ablations. This work offers a principled alternative to energy-based, pointwise non-contrastive objectives by grounding representation learning in normalized probabilistic modelling and uncertainty quantification.

Abstract

We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student-$t$ model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.

Joint Embedding Variational Bayes

TL;DR

distribution, with feature-wise uncertainty captured through a shared diag(

) variance that ties the posterior and likelihood. An asymmetric encoder–inference/target setup with stop-gradient enables fixed-observation conditioning, yielding non-degenerate posteriors and enabling density-based anomaly scoring. Empirically, VJE achieves competitive representation quality on ImageNet-1K and CIFAR/STL while providing coherent probabilistic semantics, demonstrated by strong one-class anomaly detection performance on CIFAR-10 and robust ablations. This work offers a principled alternative to energy-based, pointwise non-contrastive objectives by grounding representation learning in normalized probabilistic modelling and uncertainty quantification.

Abstract

model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.

Paper Structure (46 sections, 50 equations, 4 figures, 7 tables, 4 algorithms)

This paper contains 46 sections, 50 equations, 4 figures, 7 tables, 4 algorithms.

Introduction
Background
Non-contrastive self-supervised learning (SSL).
Variational inference and uncertainty quantification.
Variational inference in joint embedding architectures.
Model Architecture
Encoder and inference network.
Latent sampling and likelihood evaluation.
KL regularization and total objective.
Latent variable model
Likelihood distribution.
Polar decomposition of the likelihood
Directional whitening and feature-wise uncertainty
Radial reparameterization and final likelihood
Variational posterior and evidence lower bound
...and 31 more sections

Figures (4)

Figure 1: The asymmetric forward pass for one conditional direction in VJE, from view 1 to view 2. An encoder $f_\theta$ produces $\mathbf{z}_1$, and an amortized inference network $g_\phi$ maps it to a latent distribution $q_1(\mathbf{s}) = \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\sigma}_1^2)$. A sample $\mathbf{s}_1$ is drawn and the conditional likelihood of the reparameterized target observation $y_2=(\hat{\mathbf{z}}_2,\Delta r_{12})$ is evaluated under this latent code, where $\Delta r_{12}=\|\mathbf{z}_2\|-\|\mathbf{s}_1\|$. The target branch is detached (stop-gradient), enforcing fixed-observation semantics for the conditional likelihood term. The loss consists of directional ($\ell_{\mathrm{dir}}$) and radial ($\ell_{\mathrm{rad}}$) negative log-likelihoods (NLLs), jointly denoted $\mathcal{L}_{\mathrm{NLL}}$, together with a Kullback--Leibler (KL) divergence term $\mathcal{L}_{\mathrm{KL}}$.
Figure 2: One-dimensional views of the Student--$t$ loss used in VJE, plotted as functions of the residual $r$ for different degrees of freedom $\nu$ (with the Gaussian limit at $\nu=\infty$). Panel (\ref{['fig:scalar-nll']}) illustrates how heavy tails moderate the growth of the negative log-likelihood for large residuals, while panel (\ref{['fig:scalar-grad']}) shows the corresponding influence functions, where gradients saturate and then decay so that outliers contribute only a bounded amount of signal. This underlines the choice of Student--$t$ likelihoods in VJE to stabilize training without ad-hoc heuristics.
Figure 3: $k$--NN accuracy over training on CIFAR--10 for SimSiam, VICReg, and VJE ($k=30$). VJE curves are shown for both encoder output $z$ and posterior mean $\mu$, demonstrating stable convergence and close alignment between the two representations throughout training.
Figure 4: CIFAR--10 one-class detection: class-averaged AUROC across the $\beta{\times}\nu$ grid using the joint-likelihood score (Eq. \ref{['eq:ad-score']}). Lighter regions indicate higher AUROC. The optimal regime concentrates at $\beta \approx 1.0$ and small $\nu$.

Joint Embedding Variational Bayes

TL;DR

Abstract

Joint Embedding Variational Bayes

Authors

TL;DR

Abstract

Table of Contents

Figures (4)