Dual Perspectives on Non-Contrastive Self-Supervised Learning

Jean Ponce; Basile Terver; Martial Hebert; Michael Arbel

Dual Perspectives on Non-Contrastive Self-Supervised Learning

Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel

TL;DR

This paper interrogates stop-gradient (SG) and exponential moving average (EMA) procedures used to prevent collapse in non-contrastive self-supervised learning. It proves that SG and EMA do not minimize the original objective $\bar{E}$, and for squared Euclidean losses with regularization they do not correspond to the optimization of any well-defined function. In the linear setting, the associated dynamical systems have equilibria that form algebraic varieties and are generally asymptotically stable, explaining why these methods avoid degenerate solutions while not following classical gradient descent. Empirical results on real and synthetic data corroborate that SG/EMA do not converge to a minimizer of $\bar{E}$ but still yield useful representations with early-stage improvements in downstream tasks, highlighting a nuanced separation between optimization and learned representations.

Abstract

The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.

Dual Perspectives on Non-Contrastive Self-Supervised Learning

TL;DR

Abstract

Dual Perspectives on Non-Contrastive Self-Supervised Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)