Table of Contents
Fetching ...

Dual Perspectives on Non-Contrastive Self-Supervised Learning

Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel

TL;DR

This paper interrogates stop-gradient (SG) and exponential moving average (EMA) procedures used to prevent collapse in non-contrastive self-supervised learning. It proves that SG and EMA do not minimize the original objective $\bar{E}$, and for squared Euclidean losses with regularization they do not correspond to the optimization of any well-defined function. In the linear setting, the associated dynamical systems have equilibria that form algebraic varieties and are generally asymptotically stable, explaining why these methods avoid degenerate solutions while not following classical gradient descent. Empirical results on real and synthetic data corroborate that SG/EMA do not converge to a minimizer of $\bar{E}$ but still yield useful representations with early-stage improvements in downstream tasks, highlighting a nuanced separation between optimization and learned representations.

Abstract

The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.

Dual Perspectives on Non-Contrastive Self-Supervised Learning

TL;DR

This paper interrogates stop-gradient (SG) and exponential moving average (EMA) procedures used to prevent collapse in non-contrastive self-supervised learning. It proves that SG and EMA do not minimize the original objective , and for squared Euclidean losses with regularization they do not correspond to the optimization of any well-defined function. In the linear setting, the associated dynamical systems have equilibria that form algebraic varieties and are generally asymptotically stable, explaining why these methods avoid degenerate solutions while not following classical gradient descent. Empirical results on real and synthetic data corroborate that SG/EMA do not converge to a minimizer of but still yield useful representations with early-stage improvements in downstream tasks, highlighting a nuanced separation between optimization and learned representations.

Abstract

The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.

Paper Structure

This paper contains 5 sections, 2 equations, 1 figure.

Figures (1)

  • Figure 1: A (toy) illustration of the optimization landscape for the objective funtion $\bar{E}(\theta,\psi)$. Here $C$ is the global minimum of $\bar{E}(\theta,\psi)$ (shown as negative instead of zero for readibility) associated with a collapse of the training process; $B$ is a nontrivial local minimum one may reach using an appropriate regularization to avoid collapse; and $A$ is a limit point of the stop gradient (SG) training procedure associated with parameters $\theta^*$ and $\psi^*$ at convergence. In general, it is not a minimum of $\bar{E}$ and thus does not correspond to a collapse of the training process, but it is a minimum with respect to $\psi$ of $\bar{E}(\theta^*,\psi)$. See text for details.