Table of Contents
Fetching ...

One Wide Feedforward is All You Need

Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan

TL;DR

This work reveals substantial redundancy in the FFN components of Transformer models and demonstrates that sharing or dropping FFNs can yield large parameter and latency savings with minimal accuracy loss. By introducing the One Wide FFN architecture—sharing a single enlarged FFN across encoder layers and removing the decoder FFN—the authors achieve improved accuracy and faster inference compared to a fully parameterized Transformer Big. They validate these findings across MT tasks, decoder-only variants, low-resource directions, and multilingual setups, and supplement them with representational-similarity analyses (CKA and LNS) and qualitative redundancy assessments. The results suggest FFN sharing as a practical path to efficient, scalable Transformers without sacrificing much performance, with encoder-wide widening proving particularly effective.

Abstract

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

One Wide Feedforward is All You Need

TL;DR

This work reveals substantial redundancy in the FFN components of Transformer models and demonstrates that sharing or dropping FFNs can yield large parameter and latency savings with minimal accuracy loss. By introducing the One Wide FFN architecture—sharing a single enlarged FFN across encoder layers and removing the decoder FFN—the authors achieve improved accuracy and faster inference compared to a fully parameterized Transformer Big. They validate these findings across MT tasks, decoder-only variants, low-resource directions, and multilingual setups, and supplement them with representational-similarity analyses (CKA and LNS) and qualitative redundancy assessments. The results suggest FFN sharing as a practical path to efficient, scalable Transformers without sacrificing much performance, with encoder-wide widening proving particularly effective.

Abstract

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
Paper Structure (34 sections, 3 equations, 2 figures, 10 tables)

This paper contains 34 sections, 3 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: cka self similarity of encoder and decoder layers of the One Wide Encoder model vs. the Transformer Big baseline. We identify each component with a label: index.name. For example, 0.sa refers to the self-attention on layer $0$, while 4.ca refers to the cross-attention on layer $4$.
  • Figure 2: Layerwise lns between SharedEncSharedDec and Transformer Big (blue bars). lns between two versions of Transformer Big trained from different random initializations are shown by the grey bars to ground the comparison. FFN sharing does not dramatically change activations produced at each layer.