UnitNorm: Rethinking Normalization for Transformers in Time Series

Nan Huang, Christian Kümmerle, Xiang Zhang

TL;DR

UnitNorm introduces an input-norm-based normalization UN$(oldsymbol{X}) = D^{k/2} rac{oldsymbol{X}}{\|oldsymbol{X}\|_2}$ that omits centering to preserve dot-product signs in self-attention, addressing token shift, attention shift, and sparse attention in time-series Transformers. The framework situates UnitNorm as a variant of LayerNorm/RMSNorm, with a tunable hyperparameter $k$ that modulates attention sparsity via an entropy-lower-bound (ELB), enabling both dense and sparse attention patterns. Empirically, UnitNorm boosts performance across long-horizon forecasting, classification, and anomaly detection tasks, often outperforming BatchNorm, LayerNorm, and RMSNorm across multiple architectures. The work advocates reevaluating normalization strategies for TSA Transformers and presents UnitNorm as a practical, drop-in solution that improves stability and attention fidelity in complex sequential data domains.

Abstract

Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges. Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets. Notably, UnitNorm shows superior performance, especially in scenarios requiring robust attention mechanisms and contextual comprehension, evidenced by significant improvements by up to a 1.46 decrease in MSE for forecasting, and a 4.89% increase in accuracy for classification. This work not only calls for a reevaluation of normalization strategies in time series Transformers but also sets a new direction for enhancing model performance and stability. The source code is available at https://anonymous.4open.science/r/UnitNorm-5B84.

UnitNorm: Rethinking Normalization for Transformers in Time Series

TL;DR

UnitNorm introduces an input-norm-based normalization UN

that omits centering to preserve dot-product signs in self-attention, addressing token shift, attention shift, and sparse attention in time-series Transformers. The framework situates UnitNorm as a variant of LayerNorm/RMSNorm, with a tunable hyperparameter

that modulates attention sparsity via an entropy-lower-bound (ELB), enabling both dense and sparse attention patterns. Empirically, UnitNorm boosts performance across long-horizon forecasting, classification, and anomaly detection tasks, often outperforming BatchNorm, LayerNorm, and RMSNorm across multiple architectures. The work advocates reevaluating normalization strategies for TSA Transformers and presents UnitNorm as a practical, drop-in solution that improves stability and attention fidelity in complex sequential data domains.

Abstract

Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges. Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets. Notably, UnitNorm shows superior performance, especially in scenarios requiring robust attention mechanisms and contextual comprehension, evidenced by significant improvements by up to a 1.46 decrease in MSE for forecasting, and a 4.89% increase in accuracy for classification. This work not only calls for a reevaluation of normalization strategies in time series Transformers but also sets a new direction for enhancing model performance and stability. The source code is available at https://anonymous.4open.science/r/UnitNorm-5B84.

Paper Structure (24 sections, 7 theorems, 51 equations, 14 figures, 11 tables)

This paper contains 24 sections, 7 theorems, 51 equations, 14 figures, 11 tables.

Table of Contents

Introduction
Challenges in Normalization
Token shift
Attention shift
Sparse attention
Methodology
Theoretical foundation
Overcoming defects
Experiments
Discussion
Related work
Adopting UnitNorm in Transformer models
Limitations
Conclusion
Dimension Dependence of Sign-Flip Probability
...and 9 more sections

Key Result

Theorem 2.1

Assume that $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_{x}, \mathop{\mathrm{diag}}\nolimits\left(\boldsymbol{\sigma}_x^2\right))$, $\mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_{y}, \mathop{\mathrm{diag}}\nolimits\left(\boldsymbol{\sigma}_y^2\right))$ are two independent token vectors, with $\bol then the probability that the signs of $\mathbf{x}^{\top} \mathbf{y}$ and $\tilde{\mathbf{x}}^{\top

Figures (14)

Figure 1: Scheme of different normalization methods. The input to the normalization layers is batched sequences of token vectors $\mathbf{X} \in \mathbb{R}^{N\times L \times D}$, where $N$ is the batch size, $L$ is the sequence length and $D$ is the dimension of each token vector. The blue sections demonstrate a single slice of the input tensor for computing the mean $\mu$ and variance $\sigma^2$, while the red section shows a single slice of data for computing the vector norm $\left\|\mathbf{x}\right\|$ (see \ref{['sec:norm-difference']}).
Figure 2: Case of token shift in LayerNorm. The green cross denotes a query vector, the red and blue circles denote two key vectors. The token shift happens at the centering step of normalization and causes sign flip in dot product, while the scale step will not have such an effect.
Figure 3: Empirical statistics for attention scores after each normalization method. Results from 10 independent experiments are overlaid. $k=1.5$ is used for UnitNorm.
Figure 4: Landscape of $k_{50}$ for different $L, D$. The $k_{50}$ is the value of $k$ that achieves an ELB of half of the theoretical maximum $\log{L}$ for a given $L, D$ pair. The landscape of $k_{50}$ is rather smooth and insensitive to the sequence length $L$, indicating UnitNorm with fixed $k$ can be applied to sequences with variable length without significant change in the attention pattern.
Figure 5: Average rank of normalization methods on the long-term forecasting tasks. X-axis: number of tokens to forecast, Y-axis: average rank over models. Ranks are computed based on the MAE or MSE of each model on each task with different normalization methods (lower is better). UnitNorm and UnitNorm (learnable) achieve better results with the increase of prediction horizon, and have a slower increase in prediction error compared to other normalization methods.
...and 9 more figures

Theorems & Definitions (13)

Theorem 2.1: High probability of sign flip due to center operation
Remark 2.2
Theorem 3.1: UnitNorm preseves the gradient to the input and stablize the gradient to the learnable parameters
Theorem 3.2: UnitNorm guarantees an entropy lower bound independent of the input
Corollary 3.3: The ELB of UnitNorm can be any possible value by modulating $k$
Corollary A.1
proof : Proof of \ref{['cor:LayerNorm']}
proof : Proof of \ref{['thm:dot-product-sign-flip']}
Lemma B.1: Bernstein's Inequality, cf. Lemma 5.1 of dirksen_tail_2015
Lemma B.2: Bounds on $\psi_2$-norm of Gaussians vershynin_high-dimensional_2018
...and 3 more