Table of Contents
Fetching ...

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma

TL;DR

HybridNorm addresses the stability-performance trade-off of normalization in deep transformers by applying QKV-Norm within attention and Post-Norm in FFN, forming an intra-layer hybrid scheme. Theoretical and empirical results show improved gradient flow and robustness, with extensive experiments across dense and MoE models demonstrating superior training stability and downstream performance versus Pre-Norm, Post-Norm, and Mix-LN. Specialized handling of the first block and scaling-law analyses further validate its effectiveness for large-scale model training. The work provides practical guidance and code to adopt HybridNorm in future transformer architectures, with implications for more robust and scalable LLM training.

Abstract

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

TL;DR

HybridNorm addresses the stability-performance trade-off of normalization in deep transformers by applying QKV-Norm within attention and Post-Norm in FFN, forming an intra-layer hybrid scheme. Theoretical and empirical results show improved gradient flow and robustness, with extensive experiments across dense and MoE models demonstrating superior training stability and downstream performance versus Pre-Norm, Post-Norm, and Mix-LN. Specialized handling of the first block and scaling-law analyses further validate its effectiveness for large-scale model training. The work provides practical guidance and code to adopt HybridNorm in future transformer architectures, with implications for more robust and scalable LLM training.

Abstract

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose , a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.

Paper Structure

This paper contains 60 sections, 5 theorems, 68 equations, 12 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Suppose the the output of the attention is $S$, the input $X\in\mathbb{R}^{s\times d}$, parameters $W_Q,W_K,W_V,W_O^\top\in\mathbb{R}^{d\times d_k}$. For the attention with Pre-Norm, we have For the attention with Pre-Norm and QK-Norm, we have For the attention with QKV-Norm, we have

Figures (12)

  • Figure 1: Illustrations of different transformer layer structures: (a) Post-Norm architecture; (b) Pre-Norm architecture; (c) Pre-Norm with QK-Norm architecture; (d) HybridNorm architecture.
  • Figure 2: Layer gradient norm at step 1.
  • Figure 3: Layer gradient norm at step 100.
  • Figure 5: Training dynamics for 1.2B dense models with Pre-Norm, HybridNorm and HybridNorm$^*$ under 1T training tokens. We present the training loss, validation loss, and downstream performance on HellaSwag and ARC-Easy, demonstrating that HybridNorm$^*$ achieves superior performance.
  • Figure 6: Training dynamics for MoE-1B-7B models with Pre-Norm and HybridNorm$^*$ under 500B training tokens. We present the training loss, validation loss, and downstream performance on HellaSwag and MMLU Var, demonstrating that HybridNorm$^*$ achieves superior performance.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Remark 1
  • Theorem 1: Informal version of Theorem \ref{['thm:gradient']}
  • Lemma 1: Extention of Lemma 2 in noci2022signal
  • Lemma 2
  • Lemma 3
  • Theorem 2
  • proof : Proof of Lemma \ref{['lem:gradient of QKV-Norm']}
  • proof : Proof of Lemma \ref{['lem:gradient of Pre-Norm and QK-Norm']}
  • proof : Proof of Theorem \ref{['thm:gradient']}