Table of Contents
Fetching ...

JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

Biswa Sengupta, Jinhua Wang, Leo Brunswic

TL;DR

JPmHC introduces a spectrum-aware generalization of Hyper-Connections by replacing identity skips with a trainable multi-stream mixer under manifold constraints. It provides a dual spectral analysis framework (scalar and operator-valued free probability) to predict Jacobian spectra, and develops efficient backward passes for both Sinkhorn projections and Cayley-based orthogonal projections. The Cayley JPmHC variant shown on ARC-AGI with a 7M-parameter TRM yields faster convergence, higher exact-match accuracy, and lower compute compared with bistochastic baselines, due to preserved dynamical isometry and improved gradient conditioning. The work demonstrates that constraining mixing operators to geometric manifolds (orthogonal via Cayley, Grassmannian subspaces, etc.) can yield significant gains in stability, scalability, and efficiency for deep, multi-stream architectures with recursive computation. Overall, JPmHC offers a principled route toward spectrum-aware, topologically informed architectural design in scalable neural networks, with practical benefits for tasks requiring robust long-range gradient propagation.

Abstract

Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.

JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

TL;DR

JPmHC introduces a spectrum-aware generalization of Hyper-Connections by replacing identity skips with a trainable multi-stream mixer under manifold constraints. It provides a dual spectral analysis framework (scalar and operator-valued free probability) to predict Jacobian spectra, and develops efficient backward passes for both Sinkhorn projections and Cayley-based orthogonal projections. The Cayley JPmHC variant shown on ARC-AGI with a 7M-parameter TRM yields faster convergence, higher exact-match accuracy, and lower compute compared with bistochastic baselines, due to preserved dynamical isometry and improved gradient conditioning. The work demonstrates that constraining mixing operators to geometric manifolds (orthogonal via Cayley, Grassmannian subspaces, etc.) can yield significant gains in stability, scalability, and efficiency for deep, multi-stream architectures with recursive computation. Overall, JPmHC offers a principled route toward spectrum-aware, topologically informed architectural design in scalable neural networks, with practical benefits for tasks requiring robust long-range gradient propagation.

Abstract

Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.
Paper Structure (119 sections, 5 theorems, 67 equations, 5 figures, 15 tables, 9 algorithms)

This paper contains 119 sections, 5 theorems, 67 equations, 5 figures, 15 tables, 9 algorithms.

Key Result

Proposition 2.1

Under the Kronecker structure $Y^l = (A_n^l \otimes I_p) + D^l W^l$ and mean-field isotropy, the $\mathcal{B}$-valued order parameter $M(z) \in M_n(\mathbb{C})$ defined for $z\in M_n(\mathbb{C})$ satisfies the matrix fixed-point equation where $A_h(z) := A_n + \sigma^2 z$ is the dressed matrix. The scalar Cauchy transform is recovered by $G(z) = \frac{1}{n}\mathrm{Tr}_n\bigl(zI_n - A_h(z)^\top A_

Figures (5)

  • Figure 1: Scalar and OV theories vs. Monte Carlo singular value densities. Panels show four skip-connection types ($n=4$ streams, $p=25$ per stream, $c_2 L = 0.05$, $\eta = 0.02$, $500$ samples) at depths $L \in \{1, 2, 10\}$ (rows) for mixers $A_n \in \{\text{Identity},\, \text{Bistochastic},\, \text{Orthogonal}, \}$ (columns). The scalar Dyson prediction (dotted orange curve) matches Monte Carlo histograms at $L=1$ for all cases. At $L\in {2,10}$, bistochastic and Gaussian mixers develop mass near zero (spectral collapse), while orthogonal mixers preserve dynamical isometry. Scalar theory fails while Operator-value theory is able to fully catch the spectrum, regularization parameter $\eta$ used to reduce numerical instabilities smooth out the distribution: it pushes it away from zero and reduces the spikes thus increases the mass allocated to the 1.0 mode. The scaling $L c_2 = \text{const}$ ensures weights $W^l \sim \mathcal{N}(0, \sigma_w^2/L)$ maintain constant forward signal variance. This normalization is shown to be accurate as spectra have a main mode bounded away from 0 an infinity.
  • Figure 2: Hyper-Connected Transformer Encoder Block. Block 1 (Multi-Head Attention, bottom) feeds into Block 2 (Feed-Forward, top). Within each block the input forks into a twisted skip path (thick arrows, $H_{\mathrm{res}}$) and a compute path (thin arrows) that projects $q{\to}1$ streams, applies the layer, expands $1{\to}q$, and gates via $H_{\mathrm{out}}$; both paths merge at the $+$ node before Layer Norm.
  • Figure 3: Two representative ARC-AGI tasks. Each task requires discovering a latent rule (here, pattern tiling and region filling) from a few demonstrations, then applying it to a novel test input.
  • Figure 4: Pass@$k$ scaling comparison. Cayley consistently outperforms Sinkhorn across all sampling budgets $k$. The gap narrows at higher $k$, indicating Sinkhorn has higher prediction variance.
  • Figure 5: Evaluation accuracy curves. (a) Per-token accuracy shows both variants exceeding 85%, but (b) exact-match accuracy reveals a persistent gap: Cayley reaches 31.4% vs. Sinkhorn's 22.2%.

Theorems & Definitions (9)

  • Proposition 2.1: Kronecker collapse
  • Definition A.1: Doubly-Stochastic Matrix
  • Definition A.2: Stiefel Manifold
  • Definition A.3: Grassmann Manifold
  • Proposition H.1: Implicit Sinkhorn Gradient eisenberger2022unified
  • Proposition H.2: Gauss-Seidel Convergence Rate
  • Remark H.3
  • Proposition I.1: Norm Preservation
  • Proposition I.2: Determinant