Holonorm
Daryl Noupa Yongueng, Hamidou Tembine
TL;DR
Holonorm introduces a normalization operator for Transformer models that replaces Tanh-based schemes with a norm-based, geometry-preserving transform. Defined as $HN_p(x) = \dfrac{x}{1 + \|x\|_p}$ (often with $p=1$ or $p=2$), Holonorm maps inputs into the open unit ball, is invertible, 1-Lipschitz, and preserves vector direction and approximate orthogonality, addressing saturation and distortion issues inherent to Tanh. The paper presents theoretical properties, including invertibility and a bound on output norm, and demonstrates through experiments on musical and orthogonal-vector datasets that Holonorm maintains orthogonality and yields better stability and lower error metrics than Tanh in normalization contexts. The results suggest Holonorm’s potential to improve stability and fidelity in deep Transformer architectures, particularly for high-dimensional token representations and long-context reasoning. These findings point to practical benefits for attention mechanisms and sequence modeling where preserved geometry and stable activations are crucial.
Abstract
Normalization is a key point in transformer training . In Dynamic Tanh (DyT), the author demonstrated that Tanh can be used as an alternative layer normalization (LN) and confirmed the effectiveness of the idea. But Tanh itself faces orthogonality, linearity and distortion problems. Due to that, his proposition cannot be reliable. So we propose a Holonorm (hn) which has residual connections and nonlinearity. Holonorm is suitable for replacing Tanh in the context of normalization. Although the HoloNorm expression could be similar to the softsign function in dimension one, softsign is a componentwise function which is not good for tensors and vectors of great dimension. Holonorm preserves the orthogonality, the direction, the invertibility of the signal. Holonorm is also a suitable metric, maps all vectors into the open unit ball. This prevents exploding activations and improves stability in deep Transformer models. In this work, we have meticulously examined the normalization in transformers and say that Holonorm, a generalized form of softsign function suited as a normalization function first.Second, defined between 0 and 1 hn serves as a percentage, and $1 - \text{Holonorm}$ is its complement, making it better understandable in evaluating a model.
