Table of Contents
Fetching ...

Joint Embedding Variational Bayes

Amin Oji, Paul Fieguth

TL;DR

VJE introduces a normalized probabilistic formulation for non-contrastive self-supervised learning by positing a latent-variable model over encoder embeddings and optimizing a symmetric conditional ELBO. The likelihood is factorized into directional and radial components via a polar decomposition and is modeled with a heavy-tailed Student-$t$ distribution, with feature-wise uncertainty captured through a shared diag($oldsymbol{ u}$) variance that ties the posterior and likelihood. An asymmetric encoder–inference/target setup with stop-gradient enables fixed-observation conditioning, yielding non-degenerate posteriors and enabling density-based anomaly scoring. Empirically, VJE achieves competitive representation quality on ImageNet-1K and CIFAR/STL while providing coherent probabilistic semantics, demonstrated by strong one-class anomaly detection performance on CIFAR-10 and robust ablations. This work offers a principled alternative to energy-based, pointwise non-contrastive objectives by grounding representation learning in normalized probabilistic modelling and uncertainty quantification.

Abstract

We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student-$t$ model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.

Joint Embedding Variational Bayes

TL;DR

VJE introduces a normalized probabilistic formulation for non-contrastive self-supervised learning by positing a latent-variable model over encoder embeddings and optimizing a symmetric conditional ELBO. The likelihood is factorized into directional and radial components via a polar decomposition and is modeled with a heavy-tailed Student- distribution, with feature-wise uncertainty captured through a shared diag() variance that ties the posterior and likelihood. An asymmetric encoder–inference/target setup with stop-gradient enables fixed-observation conditioning, yielding non-degenerate posteriors and enabling density-based anomaly scoring. Empirically, VJE achieves competitive representation quality on ImageNet-1K and CIFAR/STL while providing coherent probabilistic semantics, demonstrated by strong one-class anomaly detection performance on CIFAR-10 and robust ablations. This work offers a principled alternative to energy-based, pointwise non-contrastive objectives by grounding representation learning in normalized probabilistic modelling and uncertainty quantification.

Abstract

We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student- model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.
Paper Structure (46 sections, 50 equations, 4 figures, 7 tables, 4 algorithms)

This paper contains 46 sections, 50 equations, 4 figures, 7 tables, 4 algorithms.

Figures (4)

  • Figure 1: The asymmetric forward pass for one conditional direction in VJE, from view 1 to view 2. An encoder $f_\theta$ produces $\mathbf{z}_1$, and an amortized inference network $g_\phi$ maps it to a latent distribution $q_1(\mathbf{s}) = \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\sigma}_1^2)$. A sample $\mathbf{s}_1$ is drawn and the conditional likelihood of the reparameterized target observation $y_2=(\hat{\mathbf{z}}_2,\Delta r_{12})$ is evaluated under this latent code, where $\Delta r_{12}=\|\mathbf{z}_2\|-\|\mathbf{s}_1\|$. The target branch is detached (stop-gradient), enforcing fixed-observation semantics for the conditional likelihood term. The loss consists of directional ($\ell_{\mathrm{dir}}$) and radial ($\ell_{\mathrm{rad}}$) negative log-likelihoods (NLLs), jointly denoted $\mathcal{L}_{\mathrm{NLL}}$, together with a Kullback--Leibler (KL) divergence term $\mathcal{L}_{\mathrm{KL}}$.
  • Figure 2: One-dimensional views of the Student--$t$ loss used in VJE, plotted as functions of the residual $r$ for different degrees of freedom $\nu$ (with the Gaussian limit at $\nu=\infty$). Panel (\ref{['fig:scalar-nll']}) illustrates how heavy tails moderate the growth of the negative log-likelihood for large residuals, while panel (\ref{['fig:scalar-grad']}) shows the corresponding influence functions, where gradients saturate and then decay so that outliers contribute only a bounded amount of signal. This underlines the choice of Student--$t$ likelihoods in VJE to stabilize training without ad-hoc heuristics.
  • Figure 3: $k$--NN accuracy over training on CIFAR--10 for SimSiam, VICReg, and VJE ($k=30$). VJE curves are shown for both encoder output $z$ and posterior mean $\mu$, demonstrating stable convergence and close alignment between the two representations throughout training.
  • Figure 4: CIFAR--10 one-class detection: class-averaged AUROC across the $\beta{\times}\nu$ grid using the joint-likelihood score (Eq. \ref{['eq:ad-score']}). Lighter regions indicate higher AUROC. The optimal regime concentrates at $\beta \approx 1.0$ and small $\nu$.