HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Zhijian Zhuo; Yutao Zeng; Ya Wang; Sijun Zhang; Jian Yang; Xiaoqing Li; Xun Zhou; Jinwen Ma

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma

TL;DR

HybridNorm addresses the stability-performance trade-off of normalization in deep transformers by applying QKV-Norm within attention and Post-Norm in FFN, forming an intra-layer hybrid scheme. Theoretical and empirical results show improved gradient flow and robustness, with extensive experiments across dense and MoE models demonstrating superior training stability and downstream performance versus Pre-Norm, Post-Norm, and Mix-LN. Specialized handling of the first block and scaling-law analyses further validate its effectiveness for large-scale model training. The work provides practical guidance and code to adopt HybridNorm in future transformer architectures, with implications for more robust and scalable LLM training.

Abstract

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks, especially regarding the position of the layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence to demonstrate that HybridNorm improves the gradient flow and the model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

TL;DR

Abstract

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (9)