Matrix Information Theory for Self-Supervised Learning

Yifan Zhang; Zhiquan Tan; Jingqin Yang; Weiran Huang; Yang Yuan

Matrix Information Theory for Self-Supervised Learning

Yifan Zhang, Zhiquan Tan, Jingqin Yang, Weiran Huang, Yang Yuan

TL;DR

This work develops Matrix-SSL, a matrix information theory-based framework that unifies contrastive and non-contrastive self-supervised learning by incorporating matrix uniformity and matrix alignment losses. It introduces matrix-based information measures (ME, MKL, MCE) and proves relationships that connect these to existing MEC and TCR formulations, while highlighting effective rank as a diagnostic of dimensionality and information preservation. Empirically, Matrix-SSL improves ImageNet linear evaluation and MS-COCO transfer tasks with fewer pre-training epochs, and extends the approach to large language models, achieving notable gains on GSM8K and MATH benchmarks. The approach offers a principled, scalable pathway to leverage covariance- and cross-covariance structures in SSL, with potential broad impact on vision and NLP representations.

Abstract

The maximum entropy encoding framework provides a unified perspective for many non-contrastive learning methods like SimSiam, Barlow Twins, and MEC. Inspired by this framework, we introduce Matrix-SSL, a novel approach that leverages matrix information theory to interpret the maximum entropy encoding loss as matrix uniformity loss. Furthermore, Matrix-SSL enhances the maximum entropy encoding method by seamlessly incorporating matrix alignment loss, directly aligning covariance matrices in different branches. Experimental results reveal that Matrix-SSL outperforms state-of-the-art methods on the ImageNet dataset under linear evaluation settings and on MS-COCO for transfer learning tasks. Specifically, when performing transfer learning tasks on MS-COCO, our method outperforms previous SOTA methods such as MoCo v2 and BYOL up to 3.3% with only 400 epochs compared to 800 epochs pre-training. We also try to introduce representation learning into the language modeling regime by fine-tuning a 7B model using matrix cross-entropy loss, with a margin of 3.1% on the GSM8K dataset over the standard cross-entropy loss. Code available at https://github.com/yifanzhang-pro/Matrix-SSL.

Matrix Information Theory for Self-Supervised Learning

TL;DR

Abstract

Paper Structure (36 sections, 12 theorems, 71 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 12 theorems, 71 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Background
Contrastive Learning
Non-contrastive Learning
Matrix Information-Theoretic Quantities
Illustrative example.
Effective Rank
On TCR and Matrix KL Divergence
Proof sketch, see the full proof in Appendix \ref{['proof:mce-tcr-md']}.
Matrix Uniformity and Alignment
Matrix-SSL: Uniformity and Alignment
Effective Rank and Dimensional Collapse
Experiments
Experimental Setup
...and 21 more sections

Key Result

Lemma 3.4

For any non-zero matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{A}\mathbf{A}^\top$ is positive semi-definite.

Figures (4)

Figure 1: Illustration of the Matrix-SSL architecture. The diagram begins with the image input layer, followed by data augmentations and feature extraction, leading to the formation of covariance matrices ($\mathbf{Z}_1\mathbf{Z}_1^\top$ and $\mathbf{Z}_2\mathbf{Z}_2^\top$).
Figure 2: Visualization of feature representation for images in 5 different classes from CIFAR-100 dataset via t-SNE of various self-supervised learning methods. We find that SimCLR has larger inter-class variability than others, as the clusters seem more separable. For illustration, we also introduce a collapsed representation via SimSiam without stop gradient operation.
Figure 3: Intra-class effective rank and inter-class effective rank. It is obvious that intra-class effective rank continues to grow for BYOL or Barlow Twins, but not for SimCLR.
Figure 4: Visualization of feature representation for images in 10 different classes from CIFAR-100 dataset via t-SNE of various self-supervised learning methods. We find that in many categories, it is difficult to distinguish between two non-contrastive methods (BYOL, Barlow Twins) and contrastive method (SimCLR) by t-SNE.

Theorems & Definitions (31)

Definition 3.1: Matrix entropy for positive semi-definite matrices
Definition 3.2: Matrix KL divergence for positive semi-definite matrices amari2014information
Definition 3.3: Matrix Cross-Entropy (MCE) for positive semi-definite matrices
Lemma 3.4
Proposition 3.5: Minimization property of matrix KL divergence
Proposition 3.6: Minimization property of matrix cross-entropy
Definition 3.7: Effective rank roy2007effective
Theorem 4.1: Main Theorem
proof
Theorem 4.2: Minimization property of TCR loss
...and 21 more

Matrix Information Theory for Self-Supervised Learning

TL;DR

Abstract

Matrix Information Theory for Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (31)