Revisiting the Shape Convention of Transformer Language Models

Feng-Ting Liao; Meng-Hsi Chen; Guan-Ting Yi; Da-shan Shiu

Revisiting the Shape Convention of Transformer Language Models

Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu

TL;DR

The paper challenges the long-standing narrow-wide-narrow FFN convention in dense Transformer language models and proposes an hourglass FFN with $K$ sub-blocks and bottleneck size $d_h < d_{\text{model}}$. By integrating this hourglass FFN into a standard Transformer backbone, the authors demonstrate that hourglass configurations can outperform conventional FFNs at smaller scales (up to around $4\times 10^{8}$ parameters) and remain competitive at larger scales near $1\times 10^{9}$ parameters, often rebalancing parameters toward the attention module. Key findings include a robust width-depth trade-off with an optimal $d_{\text{model}}/L$ around 100–250 and a typical $d_h/d_{\text{model}}$ around 0.4–0.6, with deeper hourglass structures (larger $K$) yielding consistent gains. These results suggest that deeper bottlenecked FFNs combined with increased attention capacity can yield more efficient and expressive language models, potentially reshaping architectural choices for scalable transformers.

Abstract

Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.

Revisiting the Shape Convention of Transformer Language Models

TL;DR

The paper challenges the long-standing narrow-wide-narrow FFN convention in dense Transformer language models and proposes an hourglass FFN with

sub-blocks and bottleneck size

. By integrating this hourglass FFN into a standard Transformer backbone, the authors demonstrate that hourglass configurations can outperform conventional FFNs at smaller scales (up to around

parameters) and remain competitive at larger scales near

parameters, often rebalancing parameters toward the attention module. Key findings include a robust width-depth trade-off with an optimal

around 100–250 and a typical

around 0.4–0.6, with deeper hourglass structures (larger

) yielding consistent gains. These results suggest that deeper bottlenecked FFNs combined with increased attention capacity can yield more efficient and expressive language models, potentially reshaping architectural choices for scalable transformers.

Abstract

Paper Structure (35 sections, 5 equations, 4 figures, 8 tables)

This paper contains 35 sections, 5 equations, 4 figures, 8 tables.

Introduction
Background and Related Works
Narrow-wide-narrow MLP in Transformer FFN
Revisiting Shape Through Hourglass MLPs
Transformer with Hourglass FFN
Network Architecture
Hourglass Transformer Layer
Hourglass Feed-Forward Network
Hourglass FFN Transformer Shape
Parameter Redistribution from FFN to Attention.
Trading FFN Width for Depth.
Experiments
Experimental Setup
Baselines.
Transformer with Hourglass FFN.
...and 20 more sections

Figures (4)

Figure 1: Performance frontiers of Transformers with hourglass (wide-narrow-wide) versus conventional (narrow-wide-narrow) FFNstouvron2023llama. We revisit the shape convention of Transformer by replacing the narrow-wide-narrow FFN with a hourglass FFN, composing stacks of wide-narrow-wide sub-MLPs connected by residuals. We observe that Hourglass FFNs achieve comparable performance to the conventional design up to 1B parameters. Here we also show a conventional variant trained based on OLMo-2 architecture. Only the non-embedding parameters are accounted for the FLOPs.
Figure 2: Overview: revisiting the shape convention of Transformer through studying the relaxation of MLP shape in FFN. Inspired by pmlr-v235-liu24amchen2025rethinkingshapeconventionmlp, we compare transformer architectural variants with conventional FFN and hourglass FFN. (a) Conventional Transformer Block with $L'$ layers, consisting of an attention module and a conventional FFN with a narrow-wide-narrow MLP. (b) Hourglass Transformer Block with $L$ layers, consisting of an attention module followed by an hourglass FFN with $K$ hourglass-shaped MLP sub-blocks. We explore the design space by tuning parameters such as $d_{\text{model}}$, $d_h$, $K$, and $L$, allowing the Hourglass layer count $L$ to differ from the baseline $L'$.
Figure 3: Validation loss across different $d_h/d_{\text{model}}$ ratios for Hourglass FFNs with varying depth $K$ at $L=12$. The lowest validation loss is observed at K = 4 and $d_h/d_{\text{model}} \approx 0.4$. We fixed the total model size to 113M parameters.
Figure 4: Validation loss versus $d_{\text{model}}/L$ ratio for different Hourglass FFN configurations at 113M parameters. The validation loss is minimized when the ratio $d_{\text{model}}/L$ is around 110 for $K=4$; around 180 for $K=2$; around 144 for $K=1$.

Revisiting the Shape Convention of Transformer Language Models

TL;DR

Abstract

Revisiting the Shape Convention of Transformer Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)