Table of Contents
Fetching ...

IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

Xiandong Zou, Pan Zhou

TL;DR

This paper reframes normalization through the Information Bottleneck (IB) lens and introduces IBNorm, a family of normalization methods that include a bounded compression step to filter out nuisance information while preserving task-relevant predictive information. By decomposing normalization into grouping (NAP), normalization (NOP), and recovery (NRR), and by optimizing a multi-layer IB objective, IBNorm aims to maximize $I(Y;T_l)$ while minimizing $I(T_{l-1};T_l)$. The authors provide theoretical guarantees showing IBNorm achieves higher IB values and tighter generalization bounds than variance-centric methods, and they substantiate these claims with extensive experiments across language and vision models, where IBNorm variants consistently outperform BN, LN, RMSNorm, and NormalNorm. The results demonstrate practical improvements in both large language models and vision transformers, supporting the potential of information-theoretic normalization for generalization and robustness. The work also provides ablations on the compression strength and operation order, confirming the importance of the compression-then-standardization design and affine reparameterization for best performance.

Abstract

Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

TL;DR

This paper reframes normalization through the Information Bottleneck (IB) lens and introduces IBNorm, a family of normalization methods that include a bounded compression step to filter out nuisance information while preserving task-relevant predictive information. By decomposing normalization into grouping (NAP), normalization (NOP), and recovery (NRR), and by optimizing a multi-layer IB objective, IBNorm aims to maximize while minimizing . The authors provide theoretical guarantees showing IBNorm achieves higher IB values and tighter generalization bounds than variance-centric methods, and they substantiate these claims with extensive experiments across language and vision models, where IBNorm variants consistently outperform BN, LN, RMSNorm, and NormalNorm. The results demonstrate practical improvements in both large language models and vision transformers, supporting the potential of information-theoretic normalization for generalization and robustness. The work also provides ablations on the compression strength and operation order, confirming the importance of the compression-then-standardization design and affine reparameterization for best performance.

Abstract

Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

Paper Structure

This paper contains 36 sections, 14 theorems, 97 equations, 8 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

[IB Value] For any hyperparameter $\beta \in [0,1]$ and the sample dataset $S \sim \mathbb{P}(X,Y)$ of size $M$, we have

Figures (8)

  • Figure 1: Comparison of kernel density estimation for Gaussian inputs (mean 0, varying variance) under different compression operations: Standardization, IBNorm-L, IBNorm-T, and IBNorm-S ($\lambda=4$). Compression operations in IBNorm compress the tail of activations while adjusting higher-order statistics. IBNorm-L, IBNorm-T, and IBNorm-S, with the same $\lambda$, exhibit different abilities to compress the tails of activations. See more examples in Appendix \ref{['app:ce']}.
  • Figure 2: Evaluation of different normalization methods on Llama-130M and GPT-2 small. (a) shows token-level IB values evaluated at the test dataset when training only the normalization layers, and (b) reports test performance on the LLM Leaderboard II.
  • Figure 3: Comparison of kernel density estimation across different compression operation (Standardization, IBNorm-L ($\lambda=4$), IBNorm-T ($\lambda=4$), IBNorm-S ($\lambda=3$)) given Exponential distribution inputs with mean 0 and different lambdas. Compression operations in IBNorm compress the tail in the activations and adjust the higher-order statistics.
  • Figure 4: Comparison of kernel density estimation across different compression operation (Standardization, IBNorm-L ($\lambda=4$), IBNorm-T ($\lambda=4$), IBNorm-S ($\lambda=3$)) given Laplace distribution inputs with mean 0 and different scales. Compression operations in IBNorm compress the tail in the activations and adjust the higher-order statistics.
  • Figure 5: Comparison of kernel density estimation across different compression operation (Standardization, IBNorm-L ($\lambda=4$), IBNorm-T ($\lambda=4$), IBNorm-S ($\lambda=3$)) given Gaussian distribution inputs with mean 0 and different variance. Compression operations in IBNorm compress the tail in the activations and adjust the higher-order statistics.
  • ...and 3 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Corollary 2: Generalization Bound
  • Theorem
  • proof
  • Lemma 3
  • Proposition 4: Entropy Reduction
  • Lemma 5
  • Lemma
  • proof
  • Lemma 6
  • ...and 9 more