Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

Shaobo Wang; Xiangdong Zhang; Dongrui Liu; Junchi Yan

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

Shaobo Wang, Xiangdong Zhang, Dongrui Liu, Junchi Yan

TL;DR

This work identifies a feature condensation phenomenon in Batch Normalization, where increased feature cosine similarity during training hinders learning. It proposes Unified Batch Normalization (UBN), a two-stage framework: first, a Feature Condensation Threshold (FCT) governs running statistics to reduce condensation; second, a unified set of rectifications across BN components (normalization, affine, and running statistics) enhances representation learning. Empirical results on ImageNet, CIFAR-10/100, and COCO show UBN yields substantial gains in accuracy and mean Average Precision, along with faster convergence in early training stages. The approach is designed to be easily drop-in for existing architectures, offering robust performance improvements across vision tasks and backbone variants with a relatively simple, principled mechanism grounded in feature-space analysis.

Abstract

Batch Normalization (BN) has become an essential technique in contemporary neural network design, enhancing training stability. Specifically, BN employs centering and scaling operations to standardize features along the batch dimension and uses an affine transformation to recover features. Although standard BN has shown its capability to improve deep neural network training and convergence, it still exhibits inherent limitations in certain cases. Current enhancements to BN typically address only isolated aspects of its mechanism. In this work, we critically examine BN from a feature perspective, identifying feature condensation during BN as a detrimental factor to test performance. To tackle this problem, we propose a two-stage unified framework called Unified Batch Normalization (UBN). In the first stage, we employ a straightforward feature condensation threshold to mitigate condensation effects, thereby preventing improper updates of statistical norms. In the second stage, we unify various normalization variants to boost each component of BN. Our experimental results reveal that UBN significantly enhances performance across different visual backbones and different vision tasks, and notably expedites network training convergence, particularly in early training stages. Notably, our method improved about 3% in accuracy on ImageNet classification and 4% in mean average precision on both Object Detection and Instance Segmentation on COCO dataset, showing the effectiveness of our approach in real-world scenarios.

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

TL;DR

Abstract

Paper Structure (34 sections, 5 equations, 40 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 5 equations, 40 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Normalization Methods
Component-Specific Approaches to Boost BN
Running Statistics in BN across iterations
Feature Learning
Method
Revisiting Batch Normalization
Identifying the Feature Condensation Phenomenon in Training with BN
Alleviating Feature Condensation with Rectifications
Feature Condensation Rectification
Normalization Rectification
Affine Rectification
Experiments
Implementation Details
...and 19 more sections

Figures (40)

Figure 1: The feature condensation phenomenon and learning dynamics when performing normalization methods on the CIFAR-10 dataset (a) The average feature cosine similarity between input features of the first normalization layers in ResNet-34 he2015deep. UBN reduces the feature condensation phenomenon by properly leveraging the running statistics, improving normalization performance. (b) The training loss of ResNet-34 of different representative normalization methods. (c) The testing accuracy of ResNet-34 of different representative normalization methods. UBN shows faster convergence and better performance than BN and other state-of-the-art (SOTA) normalization methods. We used non-uniform axis scale to visualize the differences between UBN and other SOTA normalization methods in terms of loss and accuracy.
Figure 2: The feature condensation heatmap of ResNet-50 on the CIFAR-100 dataset. We randomly sampled 32 images as input and calculated the average cosine similarity of the input features of the first normalization layer of ResNet-50 before, during, and at the end of training. (a) When leveraging BN, the feature condensation became significant during training. (b) When leveraging UBN, the feature condensation phenomenon could be dramatically alleviated compared to BN. Note that the weight initialization of each model randomly decides the feature in the first epoch.
Figure 3: Sketch of the standard BN and our UBN. (a) BN consists of three components, i.e., centering, scaling, and affine transformation. (b) UBN improves BN with a two-stage method. In Stage 1, UBN defines the statistics for computing normalization by comparing the condensation score $S$ of the input features $\mathbf{X}$ with a given feature condensation threshold $\tau$. In Stage 2, UBN performs rectifications on each component of BN with the given statistics determined in Stage 1.
Figure 4: The learning curves and feature condensation curves of UBN and other normalization methods with ResNet-50 on the CIFAR-100 dataset. UBN can significantly reduce the feature condensation phenomenon without abandoning the batch statistics like GN wu2018group, LN ba2016layer, and IN ulyanov2017instance. The leveraging of feature condensation threshold boost the stability of training, and the unification of rectifications of each component in BN boost the performance of UBN compared with existing normalization methods. We used non-uniform axis scale to visualize the differences between UBN and other SOTA normalization methods in both loss and accuracy.
Figure 5: Effectiveness ablation (testing accuracies) on ResNet-50 and ResNet-101 trained on the CIFAR-10 dataset with batch size 128 for 200 epochs. We plot the accuracies at the end of Epoch 0, 50, and 200. Each rectification boosts the testing performance of BN.
...and 35 more figures

Theorems & Definitions (1)

definition thmcounterdefinition: Feature Condensation

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

TL;DR

Abstract

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (40)

Theorems & Definitions (1)