Table of Contents
Fetching ...

Radial-VCReg: More Informative Representation Learning Through Radial Gaussianization

Yilun Kuang, Yash Dagade, Deep Chakraborty, Erik Learned-Miller, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

TL;DR

Radial-VCReg tackles the challenge of maximizing information in self-supervised learning by moving beyond VCReg’s linear dependency regularization. It introduces a radial Gaussianization loss that aligns the feature radius with a Chi distribution, and integrates it into Radial-VICReg to widen the class of distributions that can be Gaussianized. The authors provide theoretical guarantees showing Radial-VCReg strictly expands Gaussianizable distributions compared to VCReg, and they demonstrate consistent empirical gains on synthetic data and real-world image datasets like CIFAR-100, ImageNet-10, and CelebA. This approach offers a principled way to reduce higher-order dependencies and produce more diverse, informative representations with practical implications for downstream tasks.

Abstract

Self-supervised learning aims to learn maximally informative representations, but explicit information maximization is hindered by the curse of dimensionality. Existing methods like VCReg address this by regularizing first and second-order feature statistics, which cannot fully achieve maximum entropy. We propose Radial-VCReg, which augments VCReg with a radial Gaussianization loss that aligns feature norms with the Chi distribution-a defining property of high-dimensional Gaussians. We prove that Radial-VCReg transforms a broader class of distributions towards normality compared to VCReg and show on synthetic and real-world datasets that it consistently improves performance by reducing higher-order dependencies and promoting more diverse and informative representations.

Radial-VCReg: More Informative Representation Learning Through Radial Gaussianization

TL;DR

Radial-VCReg tackles the challenge of maximizing information in self-supervised learning by moving beyond VCReg’s linear dependency regularization. It introduces a radial Gaussianization loss that aligns the feature radius with a Chi distribution, and integrates it into Radial-VICReg to widen the class of distributions that can be Gaussianized. The authors provide theoretical guarantees showing Radial-VCReg strictly expands Gaussianizable distributions compared to VCReg, and they demonstrate consistent empirical gains on synthetic data and real-world image datasets like CIFAR-100, ImageNet-10, and CelebA. This approach offers a principled way to reduce higher-order dependencies and produce more diverse, informative representations with practical implications for downstream tasks.

Abstract

Self-supervised learning aims to learn maximally informative representations, but explicit information maximization is hindered by the curse of dimensionality. Existing methods like VCReg address this by regularizing first and second-order feature statistics, which cannot fully achieve maximum entropy. We propose Radial-VCReg, which augments VCReg with a radial Gaussianization loss that aligns feature norms with the Chi distribution-a defining property of high-dimensional Gaussians. We prove that Radial-VCReg transforms a broader class of distributions towards normality compared to VCReg and show on synthetic and real-world datasets that it consistently improves performance by reducing higher-order dependencies and promoting more diverse and informative representations.
Paper Structure (28 sections, 2 theorems, 17 equations, 5 figures, 4 tables)

This paper contains 28 sections, 2 theorems, 17 equations, 5 figures, 4 tables.

Key Result

Proposition 1

Let $\mathbf{X}$ be a random vector in $\mathbb{R}^d$ with distribution $P_{\mathbf{X}}$. Define the VCReg map and Radial-VCReg map as where $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$, $\boldsymbol{\Sigma} = \mathrm{Cov}[\mathbf{X}]$, $F_{\|\boldsymbol{\Sigma}^{-1/2} (\mathbf{x} - \boldsymbol{\mu})\|_2}$ is the CDF of the radial component of the whitened random vector, and $F_\chi^{-1}$ is the i

Figures (5)

  • Figure 1: The Radial-VCReg objective more effectively pushes samples from a non-elliptically symmetric $\mathrm{X}$-distribution towards the standard normal distribution in 2D compared to the VCReg objective. (a) The $\mathrm{X}$-distribution has an identity covariance matrix, but it is not elliptically symmetric. (b) Samples from the $\mathrm{X}$-distribution are optimized with the Radial-VCReg loss, yielding a spherical structure. (c) As the ratio $\alpha$ of samples from the $\mathrm{X}$-distribution increases, samples optimized with the Radial-VCReg loss are closer to the standard normal compared to that of VCReg. The VCReg objective is also unable to move the samples away from their starting distributions.
  • Figure 2: Radial-VICReg enforces a chi-distributed radius after optimization, and there exists a correlation between classification accuracy and the quality of the chi-distribution matching. (a) The feature norm distribution at random initialization with Wasserstein distance $W_1$ to the Chi distribution $\chi$ equal to $17.15$. (b) Feature norm distribution under the VICReg loss is far away from the Chi distribution. (c) Representations learned with Radial-VICReg is closely matching the Chi distribution density function. (d) Across hyperparameter sweeps, validation accuracy increases as the radii distribution better matches the $\chi$-distribution as measured by lower Wasserstein distance.
  • Figure 3: There exist distributions that minimize the Radial-VCReg loss but are not Gaussian (a) The sunshine distribution is built by first generating points from a 2D isotropic Gaussian distribution. These points are then converted to polar coordinates and sorted into a specified number of pie slices. Finally, every even-numbered slice is rotated clockwise, creating a distinctive pattern of segmented, rotated clusters. (b) As the weighting $\beta_2$ for the entropy term in the radial Gaussianization loss increases, samples are pushed towards the circle of radius $\sqrt{d-1}$. In $2$-dimensions, the radius is just $1$. (c) For both the X distribution and the Sunshine distribution, we observe a correlation between the E2MC loss and the radial Gaussianization loss. As both losses decreases, the optimized samples are also closer to a standard normal as measured by the Wasserstein distance.
  • Figure 4: Radial Gaussianization aligns radii distributions with the $\chi$ distribution. Comparison of (a) direct Wasserstein-1 optimization, (b) Radial-VICReg optimization, and (c) no optimization. Both Wasserstein-1 and Radial-VICReg push the empirical radii distribution closer to the target $\chi$ distribution, with Radial-VICReg achieving a substantial improvement over the unoptimized baseline.
  • Figure 5: The optimal performance of Radial-VICReg can be obtained with $\beta_1\neq\beta_2$, even if $\beta_1=\beta_2$ gives theoretically consistent estimator of the underlying KL divergence. We observe that sometimes it's better to have $\beta_1>\beta_2$ for optimal performance in downstream tasks.

Theorems & Definitions (4)

  • Proposition 1
  • Lemma 2
  • proof
  • proof