Table of Contents
Fetching ...

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

TL;DR

Hyper-Connections (HC) replace fixed residual strengths with learnable depth-connections and width-connections, enabling dynamic layer rearrangement in transformers. The framework includes static HC (SHC) and dynamic HC (DHC), initialized to mimic Pre-Norm residuals and extended with input-dependent parameters, respectively. Across dense and MoE language models up to 7B and in vision tasks, DHC yields faster convergence and improved accuracy (e.g., improved ARC-Challenge performance and reduced losses), with large-scale 7B results showing stable training and fewer spikes. Visualization uncovers a Lambda-shaped pattern of cross-layer interactions, supporting HC’s ability to balance long-range and local connectivity while incurring negligible parameter and compute overhead. Overall, HC provides a broadly applicable, effective alternative to residual connections in deep networks.

Abstract

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.

Hyper-Connections

TL;DR

Hyper-Connections (HC) replace fixed residual strengths with learnable depth-connections and width-connections, enabling dynamic layer rearrangement in transformers. The framework includes static HC (SHC) and dynamic HC (DHC), initialized to mimic Pre-Norm residuals and extended with input-dependent parameters, respectively. Across dense and MoE language models up to 7B and in vision tasks, DHC yields faster convergence and improved accuracy (e.g., improved ARC-Challenge performance and reduced losses), with large-scale 7B results showing stable training and fewer spikes. Visualization uncovers a Lambda-shaped pattern of cross-layer interactions, supporting HC’s ability to balance long-range and local connectivity while incurring negligible parameter and compute overhead. Overall, HC provides a broadly applicable, effective alternative to residual connections in deep networks.

Abstract

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.
Paper Structure (39 sections, 43 equations, 18 figures, 19 tables, 3 algorithms)

This paper contains 39 sections, 43 equations, 18 figures, 19 tables, 3 algorithms.

Figures (18)

  • Figure 1: The performance of the baseline model OLMoE-1B-7B and the model with hyper-connections, OLMoE-1B-7B-DHC$\times$4. (1) and (2) show the training loss (0.99 EMA smoothed) and the C4-en validation loss, respectively. Our method converges 1.8 times faster compared to the baseline and maintains a significant advantage at the 500B tokens. (3) and (4) show the accuracy curves on HellaSwag and ARC-Challenge, demonstrating the superior performance of the OLMoE-1B-7B-DHC$\times$4 model.
  • Figure 2: Hyper-connections (HC) with an expansion rate of $n=2$. (a) Residual connections. (b) Hyper-connections: $\beta_1$, $\beta_2$, $\alpha_{0,0}$, $\alpha_{0,1}$, $\alpha_{1,0}$, $\alpha_{1,1}$, $\alpha_{2,1}$, and $\alpha_{2,2}$ are learnable scalars or scalars predicted by the network , depending on the specific HC version. These connections enable lateral information exchange and vertical integration of features across depths. The Transformer with HC is shown in Fig. \ref{['fig:trans_with_hc']}. They can be decoupled into depth-connections and width-connections. (c) Depth-connections perform a weighted sum between the layer output and the hidden vector $h_1$. (d) Width-connections allow information exchange between the hidden vectors $h_1$ and $h_2$.
  • Figure 3: Cosine similarity between the input of the current and the previous layers for the OLMo-1B models groeneveld2024olmo. The curve represents the median of similarity, while the shaded area indicates the range between the 5th and 95th percentiles. The red curve shows the model with Pre-Norm, and the blue curve shows that with hyper-connections.
  • Figure 4: Sequential and parallel arrangements of hyper-connections with $n=2$.
  • Figure 5: Comparison of training loss curves for different expansion rate. The left subfigure includes models with dynamic hyper-connections (DHC) at various expansion rates, while the right subfigure shows the effect of omitting the tanh function. Both subfigures illustrate how increasing the expansion rate leads to improved training loss performance over $500$B tokens. Results are smoothed using an exponential moving average with a coefficient of 0.99.
  • ...and 13 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof