JPmHC Dynamical Isometry via Orthogonal Hyper-Connections
Biswa Sengupta, Jinhua Wang, Leo Brunswic
TL;DR
JPmHC introduces a spectrum-aware generalization of Hyper-Connections by replacing identity skips with a trainable multi-stream mixer under manifold constraints. It provides a dual spectral analysis framework (scalar and operator-valued free probability) to predict Jacobian spectra, and develops efficient backward passes for both Sinkhorn projections and Cayley-based orthogonal projections. The Cayley JPmHC variant shown on ARC-AGI with a 7M-parameter TRM yields faster convergence, higher exact-match accuracy, and lower compute compared with bistochastic baselines, due to preserved dynamical isometry and improved gradient conditioning. The work demonstrates that constraining mixing operators to geometric manifolds (orthogonal via Cayley, Grassmannian subspaces, etc.) can yield significant gains in stability, scalability, and efficiency for deep, multi-stream architectures with recursive computation. Overall, JPmHC offers a principled route toward spectrum-aware, topologically informed architectural design in scalable neural networks, with practical benefits for tasks requiring robust long-range gradient propagation.
Abstract
Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.
