Table of Contents
Fetching ...

Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

Qianli Liao, Kenji Kawaguchi, Tomaso Poggio

TL;DR

This work addresses Batch Normalization's limitations for online and recurrent learning by introducing Streaming Normalization, a unifying framework that collects normalization statistics online and uses streaming gradients to update parameters without requiring full backpropagation over history. It generalizes LN and BN through Sample Normalization, General Batch Normalization, and Streaming Normalization, and strengthens online suitability with DAU and Lp normalization (notably L1). The approach is extended to recurrent networks via RGBN and RSN, and demonstrated to achieve faster convergence and robust performance across CIFAR-10 and character-level language modeling, with strong implications for hardware efficiency and biological plausibility. Overall, Streaming Normalization offers a flexible, scalable normalization paradigm that outperforms traditional methods in online, small-batch, and recurrent settings, while aligning more closely with biological processing principles.

Abstract

We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed --- recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.

Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning

TL;DR

This work addresses Batch Normalization's limitations for online and recurrent learning by introducing Streaming Normalization, a unifying framework that collects normalization statistics online and uses streaming gradients to update parameters without requiring full backpropagation over history. It generalizes LN and BN through Sample Normalization, General Batch Normalization, and Streaming Normalization, and strengthens online suitability with DAU and Lp normalization (notably L1). The approach is extended to recurrent networks via RGBN and RSN, and demonstrated to achieve faster convergence and robust performance across CIFAR-10 and character-level language modeling, with strong implications for hardware efficiency and biological plausibility. Overall, Streaming Normalization offers a flexible, scalable normalization paradigm that outperforms traditional methods in online, small-batch, and recurrent settings, while aligning more closely with biological processing principles.

Abstract

We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed --- recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.

Paper Structure

This paper contains 27 sections, 4 equations, 13 figures, 1 table, 6 algorithms.

Figures (13)

  • Figure 1: A General Framework of Normalization. A: the input to convolutional layer is a 3D matrix consists of 3 dimensions: x (image width), y (image height) and features/channel. For fully-connected layers, x=y=1. B: training with decoupled accumulation and update. C: Sample Normalization. D: General Batch Normalization. E: Streaming Normalization
  • Figure 2: Architectures for CIFAR-10. Note that C reduces to B when $k_1=k_2=1$.
  • Figure 3: Lp Normalization. The architecture is a feedforward and convolutional network (shown in Figure \ref{['fig:cifar_arch']} B). All statistical moments perform similarly well. L7 normalization is slightly worse.
  • Figure 4: Plain Mini-batch vs. Decoupled Accumulation and Update (DAU). The architecture is a feedforward and fully-connected network (shown in Figure \ref{['fig:cifar_arch']} A). S/B: Samples per Batch. B/U: Batches per Weight Update. We show there are significant performance differences between plain mini-batch (i.e., B/U=1) and Decoupled Accumulation and Update (DAU, i.e., B/U=n>1). DAU significantly improves the performance of BN with small number of samples per mini-batch (e.g., compare curve 1 with 3).
  • Figure 5: Different normalizations applied to a feedforward and fully-connected network (shown in Figure \ref{['fig:cifar_arch']} A). The right two pannels are zoomed-in versions of the left two pannels. S/B: Samples per Batch. B/U: Batches per Weight Update. "Ours" refers to Streaming Normalization with "L1 norm" (Setting B with p=1 in Section \ref{['sec:lp']}) and $\alpha_1=\beta_1=0.7$, $\alpha_2=\beta_2=0.3$ and $\beta_3=0$ (see Section \ref{['sec:hyper']} for more details about hyperparameters). We show that our algorithm works with pure online learning (1 S/B) and tiny mini-batch (2 S/B), and it outperforms Layer Normalization. The choice of S/B does not matter for layer normalization since it processes samples independently.
  • ...and 8 more figures