Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

Benjamin Berger; Victor Uc Cetina

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

Benjamin Berger, Victor Uc Cetina

TL;DR

Batchless Normalization replaces batch- statistic dependence with learned activation statistics by modeling activations as $\mathcal{N}(\mu,\sigma^2)$ and incorporating the corresponding negative log likelihood into the loss. It normalizes activations via $ a_{out} = (a_{in}-\mu)/\sigma \cdot \gamma + \beta $ with optional cross-channel sharing of $\mu$ and $\sigma$, and uses a stop-gradient mechanism to keep statistics learning decoupled from upstream gradients. A gauge term $g$ is subtracted from the gauged loss to align the objective with the true distribution, and initialization, migration, and implementation details are provided to support practical adoption. Experiments on synthetic data and CIFAR-10 indicate that Batchless Normalization achieves competitive convergence and greater stability at small batch sizes, suggesting memory and communication advantages without sacrificing performance, with potential to broaden access to large-model training on limited hardware.

Abstract

In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

TL;DR

Batchless Normalization replaces batch- statistic dependence with learned activation statistics by modeling activations as

and incorporating the corresponding negative log likelihood into the loss. It normalizes activations via

with optional cross-channel sharing of

and

, and uses a stop-gradient mechanism to keep statistics learning decoupled from upstream gradients. A gauge term

is subtracted from the gauged loss to align the objective with the true distribution, and initialization, migration, and implementation details are provided to support practical adoption. Experiments on synthetic data and CIFAR-10 indicate that Batchless Normalization achieves competitive convergence and greater stability at small batch sizes, suggesting memory and communication advantages without sacrificing performance, with potential to broaden access to large-model training on limited hardware.

Abstract

Paper Structure (18 sections, 4 equations, 1 figure, 5 tables)

This paper contains 18 sections, 4 equations, 1 figure, 5 tables.

Introduction
Batchless normalization
Gauge
How this addresses the shortcomings of batch normalization
initialization
Possible problems with batchless normalization
Migrating to batchless normalization
Implementation
Experiments
Convergence and Stability
Data set
Training and evaluation strategy
Results
CIFAR-10 classification
Conclusion and Further work
...and 3 more sections

Figures (1)

Figure 1: Structure of the example problem. Not all data points are shown. Background color indicates the association from coordinates to class label as learned by one neural network instance.

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

TL;DR

Abstract

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

Authors

TL;DR

Abstract

Table of Contents

Figures (1)