BYOL works even without batch statistics

Pierre H. Richemond; Jean-Bastien Grill; Florent Altché; Corentin Tallec; Florian Strub; Andrew Brock; Samuel Smith; Soham De; Razvan Pascanu; Bilal Piot; Michal Valko

BYOL works even without batch statistics

Pierre H. Richemond, Jean-Bastien Grill, Florent Altché, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, Michal Valko

TL;DR

This work questions the necessity of batch statistics in BYOL, a non-contrastive self-supervised method. Through BN ablations, initialization-based BN-free training, and replacement of BN with group normalization and weight standardization, it shows BN is not strictly required to avoid collapse and to learn useful representations. A BN-free BYOL with data-dependent affine initialization reaches 65.7% top-1, while GN+WS matches 73.9% top-1, closely approaching vanilla BYOL’s 74.3% on ImageNet with ResNet-50. The findings highlight initialization and batch-independent normalization as viable design choices for robust self-supervised learning without BN.

Abstract

Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids collapse to a trivial, constant representation. Thus, it has recently been hypothesized that batch normalization (BN) is critical to prevent collapse in BYOL. Indeed, BN flows gradients across batch elements, and could leak information about negative views in the batch, which could act as an implicit negative (contrastive) term. However, we experimentally show that replacing BN with a batch-independent normalization scheme (namely, a combination of group normalization and weight standardization) achieves performance comparable to vanilla BYOL ($73.9\%$ vs. $74.3\%$ top-1 accuracy under the linear evaluation protocol on ImageNet with ResNet-$50$). Our finding disproves the hypothesis that the use of batch statistics is a crucial ingredient for BYOL to learn useful representations.

BYOL works even without batch statistics

TL;DR

Abstract

BYOL works even without batch statistics

TL;DR

Abstract

Paper Structure

Table of Contents