Table of Contents
Fetching ...

Understanding Self-supervised Learning with Dual Deep Networks

Yuandong Tian, Lantao Yu, Xinlei Chen, Surya Ganguli

TL;DR

This work provides a rigorous theoretical lens for contrastive self-supervised learning with dual deep ReLU networks, showing that layerwise weight updates follow a PSD covariance operator that amplifies data-variant features surviving augmentations. By coupling this framework with hierarchical latent tree models, the authors prove that deep ReLU networks can learn latent-variable representations across layers without direct supervision, leading to emergent hierarchical features. They extend the analysis to multiple losses ($L_{simp}$, $L_{tri}^\tau$, $L_{nce}^\tau$), quantify residue terms, and validate predictions via experiments on CIFAR-10 and STL-10, including HLTM-driven synthetic data. The results offer a principled link between data augmentation, SSL dynamics, and the emergence of structured representations, with potential guidance for SSL algorithm design and interpretability. Overall, the covariance-operator viewpoint provides a unifying explanation for feature emergence in self-supervised dual-network learning.

Abstract

We propose a novel theoretical framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks (e.g., SimCLR). First, we prove that in each SGD update of SimCLR with various loss functions, including simple contrastive loss, soft Triplet loss and InfoNCE loss, the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a \emph{hierarchical latent tree model} (HLTM) and prove that the hidden neurons of deep ReLU networks can learn the latent variables in HLTM, despite the fact that the network receives \emph{no direct supervision} from these unobserved latent variables. This leads to a provable emergence of hierarchical features through the amplification of initially random selectivities through contrastive SSL. Extensive numerical studies justify our theoretical findings. Code is released in https://github.com/facebookresearch/luckmatters/tree/master/ssl.

Understanding Self-supervised Learning with Dual Deep Networks

TL;DR

This work provides a rigorous theoretical lens for contrastive self-supervised learning with dual deep ReLU networks, showing that layerwise weight updates follow a PSD covariance operator that amplifies data-variant features surviving augmentations. By coupling this framework with hierarchical latent tree models, the authors prove that deep ReLU networks can learn latent-variable representations across layers without direct supervision, leading to emergent hierarchical features. They extend the analysis to multiple losses (, , ), quantify residue terms, and validate predictions via experiments on CIFAR-10 and STL-10, including HLTM-driven synthetic data. The results offer a principled link between data augmentation, SSL dynamics, and the emergence of structured representations, with potential guidance for SSL algorithm design and interpretability. Overall, the covariance-operator viewpoint provides a unifying explanation for feature emergence in self-supervised dual-network learning.

Abstract

We propose a novel theoretical framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks (e.g., SimCLR). First, we prove that in each SGD update of SimCLR with various loss functions, including simple contrastive loss, soft Triplet loss and InfoNCE loss, the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a \emph{hierarchical latent tree model} (HLTM) and prove that the hidden neurons of deep ReLU networks can learn the latent variables in HLTM, despite the fact that the network receives \emph{no direct supervision} from these unobserved latent variables. This leads to a provable emergence of hierarchical features through the amplification of initially random selectivities through contrastive SSL. Extensive numerical studies justify our theoretical findings. Code is released in https://github.com/facebookresearch/luckmatters/tree/master/ssl.

Paper Structure

This paper contains 36 sections, 23 theorems, 138 equations, 14 figures, 6 tables.

Key Result

Theorem 1

The gradient $g_{W_l}$ of $r$ w.r.t. $W_l \in \mathbb{R}^{n_l \times n_{l-1}}$ for a single input pair $\{{\bm{x}}_1, {\bm{x}}_2\}$ is (here $K_{1,l} := K_l({\bm{x}}_1;\mathcal{W}_1)$, $K_{2,l} := K_l({\bm{x}}_2;\mathcal{W}_2)$ and $g_{W_l}:=\mathrm{vec}(\partial r/\partial W_{1,l})$):

Figures (14)

  • Figure 1: (a) Overview of the SimCLR architecture. A data point ${\bm{x}}\sim p(\cdot)$ is augmented to two views ${\bm{x}}_1,{\bm{x}}_2\sim p_\mathrm{aug}(\cdot|{\bm{x}})$, which are sent to two deep ReLU networks with identical weights $\mathcal{W}$, and their outputs are sent to contrastive loss function. (b) Detailed notations.
  • Figure 2: To analyze the functionality of the covariance operator$\mathbb{V}_{z_0}\left[\bar{K}_l(z_0)\right]$ (Theorem \ref{['thm:contrast-simclr-pairwise']}), we have Assumption \ref{['assumption:generative-model']}: (1) data are generated from some generative model with latent variable $z_0$ and $z'$, (2) augmentation takes ${\bm{x}}(z_0,z')$, changes $z'$ but keeps $z_0$ intact.
  • Figure 3: (a) 1-layer convolutional network trained with SimCLR. (b) Its associated generative models: two different objects 11 ($z_0\!\!=\!\!1$) and 101 ($z_0\!\!=\!\!2$) undergoes 1D translation. Their locations are specified by $z'$ and subject to change by data augmentation.
  • Figure 4: Hierarchical Latent Tree Models. A latent variable $z_\mu$, and its corresponding nodes $\mathcal{N}_{\mu}$ in multi-layer ReLU side, covers a subset of input ${\bm{x}}$, resembling local receptive fields in ConvNet.
  • Figure 5: Notation used in Theorem \ref{['thm:lucky-node']} and Theorem \ref{['thm:cov-operator']}. (a) Latent variable structure. (b) A fully connected part of HLTM. Conceptually, after SSL training, nodes (in circle) should realize the latent variables (in square) of the same color, while they never receive direct supervision from them. ${\bm{w}}_j$ is a weight vector that connect top node $j\in \mathcal{N}_{\mu}$ to all nodes in $\mathcal{N}_{\mu}^{\mathrm{ch}}$. For this FC part, we can also compute a covariance operator $\mathrm{OP}_\mu$ and Jacobian $J_\mu$. (c)${\bm{a}}_\mu := [\rho_{\mu\nu(k)}s_k]_{k\in \mathcal{N}_{\mu}^{\mathrm{ch}}}$ is element-wise product between selectivity and polarity of all child nodes of $z_\mu$. Its length is $|\mathcal{N}_{\mu}^{\mathrm{ch}}|$.
  • ...and 9 more figures

Theorems & Definitions (41)

  • Definition 1: The connection $K_l({\bm{x}})$
  • Theorem 1: Squared $\ell_2$ Gradient for dual deep ReLU networks
  • Theorem 2: Common Property of Contrastive Losses
  • Theorem 3: Covariance Operator for $L_{\mathrm{simp}}$
  • Theorem 4: Covariance Operator for $L_{\mathrm{tri}}^\tau{}$ and $L_{\mathrm{nce}}^\tau$ ($H = 1$, single negative pair)
  • Corollary 1
  • Theorem 5: Theorem Sketch, Lucky node at initialization for SB-HLTM
  • Theorem 6: Activation covariance in SB-HLTM
  • Definition 2: reversibility
  • Lemma 1: Recursive Gradient Update (Extension to Lemma 1 in tian2019student
  • ...and 31 more