Table of Contents
Fetching ...

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Torque Dandachi, Sophia Diggs-Galligan

Abstract

Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Abstract

Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams (), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as and exposes a single hyperparameter which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections (HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-HC achieves the minimum theoretical loss while converging up to faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-HC offer a practical avenue for scaling as a new dimension of model capacity.

Paper Structure

This paper contains 57 sections, 3 theorems, 33 equations, 30 figures, 8 tables, 1 algorithm.

Key Result

Proposition 2.1

The spectral space of a $k$-fold Kronecker product $\bigotimes_{i=1}^k \mathbf{A}_i$, where each $\mathbf{A}_i \in \mathsf{B}_q$, is identically the spectral space of $\mathsf{B}_q$. $\blacktriangleleft$$\blacktriangleleft$

Figures (30)

  • Figure 1: Spectral analysis and architectural integration of go-mHC.(Left) Spectral Comparison: Illustration of spectral reach and approximation gaps across several manifold parameterization methods. mHC (and the Sinkhorn-Knopp algorithm) is not guaranteed to always remain within the manifold after finite iterations iterations (red points). go-mHC (Ours) closely approximates the Birkhoff polytope $\mathsf{B}_d$ within an $\epsilon(s)$ error margin using only $\frac{d(d-1)}{2}$ parameters. mHC-lite provides full coverage but requires $d!$ parameters, while KromHC is restricted to tensors of $\mathsf{B}_2$, disallowing non-adjacent cycles of length $>2$ and severely restricting the space of learnable matrices. (Middle) Parameterization Pipeline: Our method maps learned skew-symmetric parameters $\mathbf{X} = -\mathbf{X}^\top \in \mathbb{R}^{ds \times ds}$ to an orthogonal matrix $\mathbf{Q} \in \mathcal{Q}(ds)$ via the Cayley Transform $(\mathbf{I}-\mathbf{X})(\mathbf{I}+\mathbf{X})^{-1}$. The final doubly stochastic matrix $\mathbf{B} \in \mathbb{R}^{d \times d}$ is obtained by a block-wise Frobenius norm projection $\frac{1}{s} \|\mathbf{Q}_{ij}\|_F^2$. (Right) Hyper-Connections: The transformation $\mathcal{H}^{\text{res}}$ is integrated into the residual stream $\mathbf{x}_{l+1} = \mathcal{H}_l^{\text{res}}\mathbf{x}_l + \text{Layer } \mathcal{F}(\mathbf{x}_l)$, where the doubly stochastic constraint ensures training stability and improved gradient flow.
  • Figure 2: Illustration of the Karpelevič region -- the spectrum of stochastic matrices -- for $\mathsf{B}_d$ with $d=2\text{ to } d=5$. The black circle corresponds to the unit disc and the red points are the eigenvalues $\omega$ of the $d\times d$ permutation matrices contained in the polytope. The affine space defined by doubly stochastic matrices is strictly contained within the volume of the blue region, except for in $\mathsf{B}_2$ where the space is a line. The spectrum of $\mathsf{B}_d$ is the union of the spectrum of $\mathsf{B}_{d-1}$ and the polygon with $d$ sides inscribed on the unit disc.
  • Figure 3: Histogram of the spectral reach of a toy model implementing the map $\mathcal{P}_{\text{go}}$ to $\mathsf{B}_3$ ($d=3$) for different values of $s \in \{1, 2\}$ on random targets. (a) When $s=1$, the $d=3$ model is able to learn maps with full spectral coverage within the hypocycloid--covering the triangular region (Karpelevič region of $\mathsf{B}_3 \setminus \mathsf{B}_2$) up to error $\epsilon(s=1)$, as well all real eigenvalues on the interval $[-1, 1]$ (Karpelevič region of $\mathsf{B}_2$). (b)$s=2$ covers a larger region, with a smaller gap in the spectra $\epsilon(s=2)<\epsilon(s=1)$.
  • Figure 4: Histogram of the spectral reach of a toy model implementing the maps $\mathcal{P}_{\text{SK}}$, $\mathcal{P}_{\text{lite}}$, $\mathcal{P}_{\text{krom}}$, and $\mathcal{P}_{\text{go}}$ on random targets in $\mathsf{B}_4$. The blue and green lines in (a-e) are the boundary of the spectrum of the $\mathsf{B}_3 \setminus \mathsf{B}_2$ and $\mathsf{B}_4 \setminus \mathsf{B}_3$ polytopes respectively, as illustrated in figure \ref{['fig:spectrum_illustration']}. (a)mHC (implemented via $\mathcal{P}_{\text{SK}}$) contains points outside of $\mathsf{B}_4$ (this effect is exaggerated for illustration via initial conditions with large variance). (b)mHC-lite shows good expressivity, filling the region near the boundary almost completely, without extending past it. (c) KromHC fails to represent the complex region inside of $\mathsf{B}_4$ -- since the "shadow" is $1$-dimensional, it fills an infinitesimal slice of the full polytope. (d) For $s=1$, we see go-mHC represent a sizeable portion of the region, although it cannot represent all matrices near the boundary of the polytope. (e) With $s=2$, go-mHC begins to represent larger regions of $\mathsf{B}_4$, filling the region more densely.
  • Figure 5: Comparing the total number of learnable parameters in mHC with its exact variants. go-mHC (green), same scaling as mHC regardless of $s$, with $s=1$ requiring less parameters due to the skew-symmetry. mHC-lite (orange) scales with $\mathcal{O}(d!)$ despite its exactness and completeness over $\mathsf{B}_d$, blowing up beyond $d=8$ and becoming intractable. KromHC (red) scales the fastest $\mathcal{O}(d\log d)$ faster than mHC by trading off expressivity and imposing a strong symmetry on residual stream interactions. KromHC relies on the factorization of $d$ and we therefore only analyze the best case parameter count when $d$ is a power of 2.
  • ...and 25 more figures

Theorems & Definitions (6)

  • Proposition 2.1
  • proof
  • Lemma 2.2
  • proof
  • Proposition 2.3
  • proof