Table of Contents
Fetching ...

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

TL;DR

<3-5 sentence high-level summary> DenseFormer tackles the inefficiency of extreme Transformer depth by introducing a Depth Weighted Average (DWA) that aggregates current and all past block outputs after every transformer block, enabling strong inter-block information flow with only a small parameter overhead. The method uses learnable weights, initialization to act as an identity, and optional dilation and periodicity to control compute, yielding data-efficient improvements in perplexity and faster inference than deeper baselines. Empirically, DenseFormer beats same-depth Transformers, matches deeper models with fewer parameters and lower memory footprint, and shows robust gains on OpenWebText2 and PG-19, including longer sequences. Analyses of the learned DWA weights reveal stable, interpretable patterns that emphasize early representations and structured information reuse, supporting the proposed mechanism for improved information flow.

Abstract

The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

TL;DR

<3-5 sentence high-level summary> DenseFormer tackles the inefficiency of extreme Transformer depth by introducing a Depth Weighted Average (DWA) that aggregates current and all past block outputs after every transformer block, enabling strong inter-block information flow with only a small parameter overhead. The method uses learnable weights, initialization to act as an identity, and optional dilation and periodicity to control compute, yielding data-efficient improvements in perplexity and faster inference than deeper baselines. Empirically, DenseFormer beats same-depth Transformers, matches deeper models with fewer parameters and lower memory footprint, and shows robust gains on OpenWebText2 and PG-19, including longer sequences. Analyses of the learned DWA weights reveal stable, interpretable patterns that emphasize early representations and structured information reuse, supporting the proposed mechanism for improved information flow.

Abstract

The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.
Paper Structure (42 sections, 4 equations, 15 figures, 7 tables)

This paper contains 42 sections, 4 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: DenseFormer architecture. The diagram in (a) shows the DenseFormer architecture with two transformer layers and a dilation of $1$. After the first (resp. second) block, the past and current intermediary representations $\{X_0, X_1\}$ (resp. $\{X_0, X_1, X_2\}$) are averaged using the first (resp. second) DWA weights $[\alpha_{0,0},\alpha_{0,1}]$ (resp. $[\alpha_{1,0},\alpha_{1,1},\alpha_{1,2}]$). The DWA weights are supported by red arrows. Those weights are represented in matrix form in (b), for a $12$ layers DenseFormer. A DWA module at depth $i$ has $i+1$ weights, represented in red. Increasing the dilation sparsifies this matrix, reducing the computational overhead without degrading the perplexity, see Section \ref{['sec:dilated']} for more details.
  • Figure 2: DWA weights with dilation and DWA period. For a $12$ layers DenseFormer, the $\alpha$ weights are sparsified using dilation $k$ and DWA periodicity $p$ (referred to as $k\textsf{x}p$). Compared to Fig. \ref{['fig:denseformer_dilation']}, only certain rows have some weights other than the upper diagonal weights (which correspond to the regular transformer information flow). Increasing the dilation and period sparsifies the $\alpha$ matrix, reducing the computational overhead without degrading the perplexity, see Sections \ref{['sec:dilated']} and \ref{['sec:period']} for more details.
  • Figure 3: Speed and performance trade-off. Comparison of speed and performance trade-off between the standard Transformer architecture and DenseFormer. The number of blocks in each architecture is reported next to the data-point. All DenseFormer models on this plot use a dilation factor of $4$. We show results using a DWA period of $1$ and $5$. Comparing perplexities: Considering only the perplexity (y-axis), a $48$ block DenseFormer performs similarly as a much deeper $72$ block Transformer. Comparing trade-offs: A $48$ block $4\textsf{x}5$-DenseFormer matches the better perplexity of a $72$ block Transformer while being $1.4\times$ faster at inference.
  • Figure 4: Training and inference efficiency of $k\textsf{x}p$-DenseFormer vs. Transformer. For $48$ block models, we compare in (a) the different perplexity/inference speed trade-offs reached by a regular Transformer and $k\textsf{x}p$-DenseFormers. In the top right corner, the Transformer baseline is the model with the worst perplexity but the fastest at inference. In contrast, the $1\textsf{x}1$-DenseFormer in the bottom left corner, is reaching the best perplexity but incurs a cost in inference speed. By varying the dilation $k$ and DWA period $p$, some $k\textsf{x}p$-DenseFormer models (e.g. $4\textsf{x}5$) provide most of the perplexity improvement of the original DenseFormer while significantly reducing the time overhead. A similar analysis holds when looking at the training speed in (b). In (c), we show the perplexity decreasing during training. The x-axis is time. To compensate for the computational overhead of DenseFormer, we train the Transformer baseline for more iterations, such that the two methods have the same training time budget. We observe how our $4\textsf{x}5$-DenseFormer is reaching a better perplexity faster than the baseline. The perplexity in this figure is computed on a small subset of the validation set to avoid slowing down the training.
  • Figure 5: Visualization of DWA Learned Weights. Each row shows the weights $\alpha$ learned by a DWA module at a given depth. While the heatmaps are averaged across 3 runs with different seeds, those patterns are very consistent across seeds. In (a) and (b), strikingly similar patterns can be observed in both $48$ and $72$ layer DenseFormers. In (c), we show the learned weights for a $48$ block DenseFormer trained with a dilation of $4$. Despite the sparsity, we still observe a very similar pattern to those learned by the non-dilated models.
  • ...and 10 more figures