Table of Contents
Fetching ...

Information Flow in Self-Supervised Learning

Zhiquan Tan, Jingqin Yang, Weiran Huang, Yang Yuan, Yifan Zhang

TL;DR

Using a matrix information-theoretic lens, the paper analyzes SSL losses across contrastive, decorrelation-based, and masked image modeling methods. It proves that spectral contrastive learning and Barlow Twins maximize matrix mutual information $I_2$ and matrix joint entropy $H_2$, while MAE-type losses reduce to entropy; this motivates a matrix-entropy regularizer, yielding Matrix Variational MAE (M-MAE) that subsumes U-MAE. The authors introduce $\,\mathcal{L}_{\text{M-MAE}} = \mathcal{L}_{\text{MAE}} - \lambda \cdot \text{TCR}_{\mu}(\mathbf{Z})$ and provide theory linking M-MAE to improved representation quality via increased entropy and effective rank. Empirically, M-MAE achieves notable gains on ImageNet with ViT backbones, validating the proposed framework and demonstrating the practical value of matrix information theory for SSL. Overall, the work offers both a unifying theoretical perspective and a concrete algorithmic advance for self-supervised visual representation learning.

Abstract

In this paper, we conduct a comprehensive analysis of two dual-branch (Siamese architecture) self-supervised learning approaches, namely Barlow Twins and spectral contrastive learning, through the lens of matrix mutual information. We prove that the loss functions of these methods implicitly optimize both matrix mutual information and matrix joint entropy. This insight prompts us to further explore the category of single-branch algorithms, specifically MAE and U-MAE, for which mutual information and joint entropy become the entropy. Building on this intuition, we introduce the Matrix Variational Masked Auto-Encoder (M-MAE), a novel method that leverages the matrix-based estimation of entropy as a regularizer and subsumes U-MAE as a special case. The empirical evaluations underscore the effectiveness of M-MAE compared with the state-of-the-art methods, including a 3.9% improvement in linear probing ViT-Base, and a 1% improvement in fine-tuning ViT-Large, both on ImageNet.

Information Flow in Self-Supervised Learning

TL;DR

Using a matrix information-theoretic lens, the paper analyzes SSL losses across contrastive, decorrelation-based, and masked image modeling methods. It proves that spectral contrastive learning and Barlow Twins maximize matrix mutual information and matrix joint entropy , while MAE-type losses reduce to entropy; this motivates a matrix-entropy regularizer, yielding Matrix Variational MAE (M-MAE) that subsumes U-MAE. The authors introduce and provide theory linking M-MAE to improved representation quality via increased entropy and effective rank. Empirically, M-MAE achieves notable gains on ImageNet with ViT backbones, validating the proposed framework and demonstrating the practical value of matrix information theory for SSL. Overall, the work offers both a unifying theoretical perspective and a concrete algorithmic advance for self-supervised visual representation learning.

Abstract

In this paper, we conduct a comprehensive analysis of two dual-branch (Siamese architecture) self-supervised learning approaches, namely Barlow Twins and spectral contrastive learning, through the lens of matrix mutual information. We prove that the loss functions of these methods implicitly optimize both matrix mutual information and matrix joint entropy. This insight prompts us to further explore the category of single-branch algorithms, specifically MAE and U-MAE, for which mutual information and joint entropy become the entropy. Building on this intuition, we introduce the Matrix Variational Masked Auto-Encoder (M-MAE), a novel method that leverages the matrix-based estimation of entropy as a regularizer and subsumes U-MAE as a special case. The empirical evaluations underscore the effectiveness of M-MAE compared with the state-of-the-art methods, including a 3.9% improvement in linear probing ViT-Base, and a 1% improvement in fine-tuning ViT-Large, both on ImageNet.
Paper Structure (22 sections, 31 theorems, 38 equations, 5 figures, 3 tables)

This paper contains 22 sections, 31 theorems, 38 equations, 5 figures, 3 tables.

Key Result

Proposition 4.1

$\operatorname{I}_2(\mathbf{K}_1; \mathbf{K}_2) = 2\log d - \log \frac{|| \mathbf{K}_1 ||^2_F || \mathbf{K}_2 ||^2_F}{|| \mathbf{K}_1 \odot \mathbf{K}_2 ||^2_F}$, where $d$ is the size of matrix $\mathbf{K}_1$.

Figures (5)

  • Figure 1: Visualization of matrix-based mutual information on CIFAR10 for Barlow-Twins, BYOL, and SimCLR.
  • Figure 2: Visualization of matrix-based joint entropy on CIFAR10 for Barlow-Twins, BYOL and SimCLR.
  • Figure 3: Tendency of matrix information quantities under different temperatures. The experiments are conducted on CIFAR-10 using SimCLR.
  • Figure 4: The effective rank during pre-training.
  • Figure 5: Visualization of matrix JS divergence and eigenspace JS divergence on CIFAR10 for Barlow-Twins, BYOL, and SimCLR.

Theorems & Definitions (56)

  • Definition 3.1: Matrix-based $\alpha$-order (Rényi) entropy skean2023dime
  • Definition 3.2: Matrix-based mutual information skean2023dime
  • Definition 3.3: Matrix-based joint entropy skean2023dime
  • Proposition 4.1
  • Lemma 4.2
  • Theorem 4.3
  • proof
  • Theorem 4.4
  • Proposition 4.5
  • Corollary 4.6
  • ...and 46 more