Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Rylan Schaeffer; Victor Lecomte; Dhruv Bhandarkar Pai; Andres Carranza; Berivan Isik; Alyssa Unell; Mikail Khona; Thomas Yerxa; Yann LeCun; SueYeon Chung; Andrey Gromov; Ravid Shwartz-Ziv; Sanmi Koyejo

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Rylan Schaeffer, Victor Lecomte, Dhruv Bhandarkar Pai, Andres Carranza, Berivan Isik, Alyssa Unell, Mikail Khona, Thomas Yerxa, Yann LeCun, SueYeon Chung, Andrey Gromov, Ravid Shwartz-Ziv, Sanmi Koyejo

TL;DR

This paper provides a theoretical and empirical deep-dive into Maximum Manifold Capacity Representations (MMCR), a MVSSL method with a nuclear-norm loss based on manifold centers. By applying high-dimensional probability, it shows MMCR promotes perfect reconstruction and uniformity of hypersphere embeddings, which in turn maximizes a variational mutual information lower bound between views. It reveals a double-descent-like behavior with respect to atypical parameters $P$ and $D$ and derives compute-scaling laws that quantify pretraining performance as compute grows. The work further demonstrates MMCR’s viability in multimodal image-text settings and situates MMCR within the duality between sample-contrastive and dimension-contrastive SSL, offering practical insights for improving MVSSL methods and their scope.

Abstract

Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

TL;DR

and

and derives compute-scaling laws that quantify pretraining performance as compute grows. The work further demonstrates MMCR’s viability in multimodal image-text settings and situates MMCR within the duality between sample-contrastive and dimension-contrastive SSL, offering practical insights for improving MVSSL methods and their scope.

Abstract

Paper Structure (14 sections, 6 theorems, 12 equations, 7 figures)

This paper contains 14 sections, 6 theorems, 12 equations, 7 figures.

Introduction
Preliminaries
Multi-View Self-Supervised Learning (MVSSL)
Maximum Manifold Capacity Representations
A High-Dimensional Probability Analysis of Maximum Manifold Capacity Representations
An Information Theoretic Understanding of Maximum Manifold Capacity Representations
Double Descent in Maximum Manifold Capacity Representations Pretraining Loss
Compute Scaling Laws in Maximum Manifold Capacity Representations
Multi-Modality in Maximum Manifold Capacity Representations
Relationship of MMCR to the Duality of Sample-Contrastive and Dimension-Contrastive Self-Supervised Learning
Discussion
Proof of Theorem \ref{['theorem:recon_plus_uniform_min_mmcr']}
Python Code for Perfect Reconstruction and Perfect Uniformity Embeddings
Multimodal Maximum Manifold Capacity Representations

Key Result

Proposition 3.3

Suppose that $\forall p \in [P], {\bm{c}}_p^T {\bm{c}}_p \leq 1$. Then, $0 \leq ||C||_* \leq \sqrt{P \, \min(P, D) }.$

Figures (7)

Figure 1: Schematic of Maximum Manifold Capacity Representations (MMCR).(Left)$K \geq 2$ views are generated of each datum, then embedded through a deep neural network on the surface of the hypersphere. Center: For each datum, the centroid of the embeddings is computed. (Right) The MMCR pretraining loss, which is the negative nuclear norm of the centers, is then minimized.
Figure 2: Embeddings with perfect reconstruction and perfect uniformity achieve the lowest possible MMCR loss. Away from the $P=D$ threshold, uniform random vectors achieve the theoretically derived upper bound on the nuclear norm of the mean matrix $M$ i.e. the lower bound on $\mathcal{L}_{MMCR}$. The gap between the network's loss and the lowest possible $\mathcal{L}_{MMCR}$ falls (left)$\propto P^{-1}$ or (right)$\propto D^{-1}$ away from the $P=D$ threshold.
Figure 3: Double-Descent in Maximum Manifold Capacity Representations. As predicted mathematically, MMCR's pretraining percent error $\mathop{\mathrm{\stackrel{\text{def}}{\; = \;}}}\nolimits (\sqrt{P \min(P, D)} - ||C||_*) / \sqrt{P \min(P, D)}$ exhibits non-monotonic double descent-like behavior, peaking when the number of data points $P$ equals the number of dimensions $D$. On either side of the $P=D$ threshold, the pretraining percent error falls. Networks are ResNet-18s pretrained on STL-10's "unlabeled" split.
Figure 4: Compute Scaling Laws. For all values of number of points $P$ (equivalently, batch size), embedding dimension $D$ and number of views per datum $K$, the pretraining percent error falls predictably as a power law with the amount of compute i.e. total floating point operations. Consistent with the double descent-like findings in Fig. \ref{['fig:double_descent']}, the on-diagonal subfigures (corresponding to $P = D$) exhibit higher initial pretraining percent errors and less steep slopes with compute than the off-diagonal subfigures (corresponding to $P \neq D$).
Figure 5: Multimodal MMCR on Image-Text Caption Pairs. Left: Multimodal MMCR vs Contrastive Language-Image Pretraining (CLIP) performance on ImageNet measured in zero-shot top-1 accuracy. Multimodal MMCR outperforms CLIP for smaller batch sizes but underperforms CLIP for larger batch sizes. Right: Imagenet top-1 accuracy sweep over batch sizes for MMCR. Unlike CLIP, MMCR exhibits non-monotonic performance scaling with batch size, and best results are found at intermediate batch sizes. To generate strong validation performance scaling behavior, MMCR requires that both batch size and dimension increase simultaneously.
...and 2 more figures

Theorems & Definitions (12)

Definition 3.1: Perfect Reconstruction
Definition 3.2: Perfect Uniformity
Proposition 3.3
proof
Proposition 3.4
proof
Theorem 3.5
Proposition 4.1
Theorem 4.2
proof
...and 2 more

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

TL;DR

Abstract

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (12)