Dual Perspectives on Non-Contrastive Self-Supervised Learning
Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel
TL;DR
This paper interrogates stop-gradient (SG) and exponential moving average (EMA) procedures used to prevent collapse in non-contrastive self-supervised learning. It proves that SG and EMA do not minimize the original objective $\bar{E}$, and for squared Euclidean losses with regularization they do not correspond to the optimization of any well-defined function. In the linear setting, the associated dynamical systems have equilibria that form algebraic varieties and are generally asymptotically stable, explaining why these methods avoid degenerate solutions while not following classical gradient descent. Empirical results on real and synthetic data corroborate that SG/EMA do not converge to a minimizer of $\bar{E}$ but still yield useful representations with early-stage improvements in downstream tasks, highlighting a nuanced separation between optimization and learned representations.
Abstract
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
