Understanding Self-supervised Learning with Dual Deep Networks
Yuandong Tian, Lantao Yu, Xinlei Chen, Surya Ganguli
TL;DR
This work provides a rigorous theoretical lens for contrastive self-supervised learning with dual deep ReLU networks, showing that layerwise weight updates follow a PSD covariance operator that amplifies data-variant features surviving augmentations. By coupling this framework with hierarchical latent tree models, the authors prove that deep ReLU networks can learn latent-variable representations across layers without direct supervision, leading to emergent hierarchical features. They extend the analysis to multiple losses ($L_{simp}$, $L_{tri}^\tau$, $L_{nce}^\tau$), quantify residue terms, and validate predictions via experiments on CIFAR-10 and STL-10, including HLTM-driven synthetic data. The results offer a principled link between data augmentation, SSL dynamics, and the emergence of structured representations, with potential guidance for SSL algorithm design and interpretability. Overall, the covariance-operator viewpoint provides a unifying explanation for feature emergence in self-supervised dual-network learning.
Abstract
We propose a novel theoretical framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks (e.g., SimCLR). First, we prove that in each SGD update of SimCLR with various loss functions, including simple contrastive loss, soft Triplet loss and InfoNCE loss, the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a \emph{hierarchical latent tree model} (HLTM) and prove that the hidden neurons of deep ReLU networks can learn the latent variables in HLTM, despite the fact that the network receives \emph{no direct supervision} from these unobserved latent variables. This leads to a provable emergence of hierarchical features through the amplification of initially random selectivities through contrastive SSL. Extensive numerical studies justify our theoretical findings. Code is released in https://github.com/facebookresearch/luckmatters/tree/master/ssl.
