SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang
TL;DR
SiameseNorm tackles the long-standing stability–expressivity trade-off between Pre-Norm and Post-Norm transformers by introducing a two-stream residual architecture with shared blocks. One stream preserves the identity-gradient dynamics (Pre-Norm-like) while the other enforces bounded representations (Post-Norm-like), and both streams are fused before every computation, enabling simultaneous stability and depth-enabled expressivity. Gradient analysis shows how the two streams collectively preserve robust gradient flow while controlling representation scale, and experiments on 1.3B-param models demonstrate strong gains in perplexity and arithmetic reasoning across aggressive learning rates, surpassing all baselines. The approach generalizes existing norms, requires negligible overhead, and offers a solid foundation for future multi-stream residual designs, though limitations such as task-specific gains and emergent massive activations remain to be explored.
Abstract
Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.
