SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Tianyu Li; Dongchen Han; Zixuan Cao; Haofeng Huang; Mengyu Zhou; Ming Chen; Erchao Zhao; Xiaoxi Jiang; Guanjun Jiang; Gao Huang

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

TL;DR

SiameseNorm tackles the long-standing stability–expressivity trade-off between Pre-Norm and Post-Norm transformers by introducing a two-stream residual architecture with shared blocks. One stream preserves the identity-gradient dynamics (Pre-Norm-like) while the other enforces bounded representations (Post-Norm-like), and both streams are fused before every computation, enabling simultaneous stability and depth-enabled expressivity. Gradient analysis shows how the two streams collectively preserve robust gradient flow while controlling representation scale, and experiments on 1.3B-param models demonstrate strong gains in perplexity and arithmetic reasoning across aggressive learning rates, surpassing all baselines. The approach generalizes existing norms, requires negligible overhead, and offers a solid foundation for future multi-stream residual designs, though limitations such as task-specific gains and emergent massive activations remain to be explored.

Abstract

Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

TL;DR

Abstract

Paper Structure (48 sections, 10 equations, 8 figures, 4 tables)

This paper contains 48 sections, 10 equations, 8 figures, 4 tables.

Introduction
Theoretical Motivation
Preliminaries
Notation
Taxonomy of Normalization Paradigms
Evaluation of Existing Paradigms
Pre-Norm
Post-Norm
Structural Incompatibility
The Dilution Problem (Pre-Norm)
The Distortion Problem (Post-Norm)
SiameseNorm
Architecture and Formulation
Generalization Capabilities
Gradient Analysis
...and 33 more sections

Figures (8)

Figure 1: Architectural comparison of Post-Norm, Pre-Norm and SiameseNorm. In SiameseNorm, the input is duplicated into parallel streams sharing identical residual updates, where distinct LN positioning differentiates the hidden states across layers.
Figure 2: Comparison of Pre-Norm, PreNorm-EmbedNorm, and our SiameseNorm using 1.3B models trained on 100B tokens with learning rate of $1\times 10^{-3}$.
Figure 3: Practical SiameseNorm design, coupling HybridNorm hybridnorm and Pre-Norm with HybridNorm Attention blocks.
Figure 4: Training loss curves of HybridNorm (yellow), HybridNorm with ResiDual (blue) and our SiameseNorm (green) without Depth-wise Scaling.
Figure 5: Gradient norm comparisons
...and 3 more figures

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

TL;DR

Abstract

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Authors

TL;DR

Abstract

Table of Contents

Figures (8)