Table of Contents
Fetching ...

Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

MinKyu Lee, Sangeek Hyun, Woojin Jun, Hyunjun Kim, Jiwoo Chung, Jae-Pil Heo

TL;DR

This work proposes Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input, and provides theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance.

Abstract

This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise entropy. We analyze this in the perspective of networks attempting to bypass LN's constraints that conflict with IR tasks. Accordingly, we address two misalignments between LN and IR: 1) per-token normalization disrupts spatial correlations, and 2) input-independent scaling discards input-specific statistics. To address this, we propose Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input. We provide theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance, validated by extensive experiments.

Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

TL;DR

This work proposes Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input, and provides theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance.

Abstract

This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise entropy. We analyze this in the perspective of networks attempting to bypass LN's constraints that conflict with IR tasks. Accordingly, we address two misalignments between LN and IR: 1) per-token normalization disrupts spatial correlations, and 2) input-independent scaling discards input-specific statistics. To address this, we propose Image Restoration Transformer Tailored Layer Normalization i-LN, a simple drop-in replacement that normalizes features holistically and adaptively rescales them per input. We provide theoretical insights and empirical evidence that this simple design effectively leads to both improved training dynamics and thereby improved performance, validated by extensive experiments.

Paper Structure

This paper contains 37 sections, 2 theorems, 17 equations, 29 figures, 15 tables, 1 algorithm.

Key Result

Proposition 1

(Vanilla LayerNorm fails to preserve structure). Let $T_{\mathrm{LN}}$ be the normalization in vanilla per-token LN. Then, in general, there do not exist $a>0$ and an orthogonal $Q$ such that Thus $T_{\mathrm{LN}}$ is not even conformal on the token set. Since homotheties are strict subclasses of conformal maps, $T_{\mathrm{LN}}$ is not a homothety and therefore it does not preserve inter-pixel s

Figures (29)

  • Figure 2: Feature magnitude evolution in IR Transformers across different settings.(a-b) Feature divergence signifies as the network scales. (c) Feature divergence appears across various Transformer backbones and IR tasks: super-resolution (SR), denoising (DN), deraining (DR), JPEG compression artifact removal (CAR), demonstrating that this phenomenon is widespread. It can be effectively mitigated by simply replacing conventional LayerNorm with the proposed $i\text{-LN}$.
  • Figure 3: Comparison between IR Transformer blocks using conventional per-token LayerNorm (LN) and our proposed $i\text{-LN}$. Contrary to conventional LN, which normalizes each token independently, our $i\text{-LN}$ applies holistic normalization across the entire spatio-channel dimension, preserving essential spatial correlations between tokens. Additionally, $i\text{-LN}$ input-adaptively rescales features after the attention (Attn) and feedforward (FFN) layers, thereby better preserving input statistics and allowing feature range flexibility. These together enable IR Transformers to preserve low-level characteristics of input throughout the network, aligning with the unique requirements of IR.
  • Figure 4: Feature divergence across various normalizations.
  • Figure 5: Eval-mode BN and removing all normalization (None) fails.
  • Figure 6: Qualitative comparison across four representative image restoration tasks.
  • ...and 24 more figures

Theorems & Definitions (4)

  • Definition 1: Inter-pixel Structure and Preservation
  • Definition 2: Structure Preserving Transformation
  • Proposition 1
  • Proposition 2