Table of Contents
Fetching ...

Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections

Zhaoyi Liu, Haichuan Zhang, Ang Li

Abstract

Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.

Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections

Abstract

Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.
Paper Structure (21 sections, 1 theorem, 34 equations, 11 figures, 4 tables)

This paper contains 21 sections, 1 theorem, 34 equations, 11 figures, 4 tables.

Key Result

Proposition 1

Let $J = \frac{1}{n}\mathbf{1}_n\mathbf{1}_n^\top$. For any displacement matrix $\mathcal{H}_l^{\mathrm{disp}} \in \mathcal{Z}_n$, the spectral norm of the corresponding residual matrix $\mathcal{H}_l^{\mathrm{res}} = J + \mathcal{H}_l^{\mathrm{disp}}$ satisfies:

Figures (11)

  • Figure 1: sHC overcomes the identity degeneration, expressivity bottleneck, and parameterization inefficiencies in existing manifold-constrained hyper-connections (mHC xie2025mhc) and its permutation-based variant (mHC-lite yang2026mhc). Left: Learned residual matrices (4 streams). Residual matrices of mHC and mHC-lite degenerate into the identity mapping, whereas sHC leverages diverse signed entries for subtractive mixing. Middle: Language Modeling Performance. Perplexity is presented relative to the standard residual connection baseline. sHC yields observable perplexity reductions across all five corpora. Right: Parameterization overhead. As the number of streams increases, sHC eliminates the factorial explosion of auxiliary parameters inherent to mHC-lite.
  • Figure 2: Dynamics of $\mathcal{H}_l^{\mathrm{res}}$ for mHC and mHC-lite during training. Left: the row-wise maximum entries of $\mathcal{H}_l^{\mathrm{res}}$. The solid lines represent the median, and the shaded regions show the 10th to 90th percentiles. Right: the proportion of $\mathcal{H}_l^{\mathrm{res}}$ where all row maximums are on the diagonal. Statistics are computed across all layers of the model at each training step.
  • Figure 3: Mean pairwise cosine similarity among residual streams after being mixed by $\mathcal{H}_l^{\mathrm{res}}$. The left shows the baseline with identity mapping (where $\mathcal{H}_l^{\mathrm{res}}$ is fixed as an identity matrix while keeping all other settings identical to mHC). The right shows mHC. Each colored line tracks a layer depth.
  • Figure 4: Overview of Spectral-Sphere-Constrained Hyper-Connections (sHC). The right orange plane depicts the zero-marginal subspace $\mathcal{Z}_n$, where the blue disk centered at the origin $O$ represents the bounded spectral region $\|\mathcal{H}_l^{\mathrm{disp}}\|_2 \le 1$. The SVD parameterization generates the displacement matrix $\mathcal{H}_l^{\mathrm{disp}}$ (red point) within this region. The left blue plane illustrates the target affine space $\mathcal{A}_n$, containing the Birkhoff polytope $\mathcal{B}_n$ (inner orange polygon), which is enclosed by the affine-constrained spectral norm sphere $\mathcal{S}_n$ (black circle centered at the uniform matrix $J$). The affine shift $+J$ maps the origin $O$ to $J$ and the displacement $\mathcal{H}_l^{\mathrm{disp}}$ to the final residual matrix $\mathcal{H}_l^{\mathrm{res}}$.
  • Figure 5: Gradient norm dynamics during training for the L model on OpenWebText. The unconstrained HC exhibits exploding gradients (light orange), clamped at 5.0 for visualization. Other residual connection paradigms show stable gradient trajectories.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Proposition 1: Spectral Decoupling
  • proof
  • proof
  • proof
  • proof