IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning
Xiandong Zou, Pan Zhou
TL;DR
This paper reframes normalization through the Information Bottleneck (IB) lens and introduces IBNorm, a family of normalization methods that include a bounded compression step to filter out nuisance information while preserving task-relevant predictive information. By decomposing normalization into grouping (NAP), normalization (NOP), and recovery (NRR), and by optimizing a multi-layer IB objective, IBNorm aims to maximize $I(Y;T_l)$ while minimizing $I(T_{l-1};T_l)$. The authors provide theoretical guarantees showing IBNorm achieves higher IB values and tighter generalization bounds than variance-centric methods, and they substantiate these claims with extensive experiments across language and vision models, where IBNorm variants consistently outperform BN, LN, RMSNorm, and NormalNorm. The results demonstrate practical improvements in both large language models and vision transformers, supporting the potential of information-theoretic normalization for generalization and robustness. The work also provides ablations on the compression strength and operation order, confirming the importance of the compression-then-standardization design and affine reparameterization for best performance.
Abstract
Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.
