Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements
Benjamin Berger, Victor Uc Cetina
TL;DR
Batchless Normalization replaces batch- statistic dependence with learned activation statistics by modeling activations as $\mathcal{N}(\mu,\sigma^2)$ and incorporating the corresponding negative log likelihood into the loss. It normalizes activations via $ a_{out} = (a_{in}-\mu)/\sigma \cdot \gamma + \beta $ with optional cross-channel sharing of $\mu$ and $\sigma$, and uses a stop-gradient mechanism to keep statistics learning decoupled from upstream gradients. A gauge term $g$ is subtracted from the gauged loss to align the objective with the true distribution, and initialization, migration, and implementation details are provided to support practical adoption. Experiments on synthetic data and CIFAR-10 indicate that Batchless Normalization achieves competitive convergence and greater stability at small batch sizes, suggesting memory and communication advantages without sacrificing performance, with potential to broaden access to large-model training on limited hardware.
Abstract
In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
