PartialFormer: Modeling Part Instead of Whole for Machine Translation

Tong Zheng; Bei Li; Huiwen Bao; Jiale Wang; Weiqiao Shan; Tong Xiao; Jingbo Zhu

PartialFormer: Modeling Part Instead of Whole for Machine Translation

Tong Zheng, Bei Li, Huiwen Bao, Jiale Wang, Weiqiao Shan, Tong Xiao, Jingbo Zhu

TL;DR

PartialFormer tackles the Transformer FFN bottleneck by introducing Partial-Level Gated FFNs (PG-FFNs), an ensemble of small, parameter-shared FFNs that preserve or expand the effective hidden dimension while drastically reducing parameters and computation. The PG-FFNs are integrated into both self- and cross-attention blocks, with a residual-like attention calculation and a two-stage hybrid scaling strategy (depth and head scaling) to boost capacity efficiently. Empirical results across nine translation tasks and one abstractive summarization task show that PartialFormer achieves higher or comparable BLEU and ROUGE scores with fewer parameters and MACs than vanilla Transformers and several baselines, aided by improved head diversity and FFN efficiency. The approach remains effective when combined with existing architectures, and ablations substantiate the importance of PG-FFN design choices, gating, and the residual-like attention mechanism for stable optimization and performance gains.

Abstract

The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimensions in designing lightweight FFNs, a factor often overlooked in previous architectures. Guided by this principle, we introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs to reduce parameters and computation while maintaining essential hidden dimensions. These smaller FFNs are integrated into a multi-head attention mechanism for effective collaboration. We also propose a tailored head scaling strategy to enhance PartialFormer's capabilities. Furthermore, we present a residual-like attention calculation to improve depth scaling within PartialFormer. Extensive experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach on machine translation and summarization tasks. Our code would be available at: https://github.com/zhengkid/PartialFormer.

PartialFormer: Modeling Part Instead of Whole for Machine Translation

TL;DR

Abstract

Paper Structure (60 sections, 5 equations, 5 figures, 21 tables)

This paper contains 60 sections, 5 equations, 5 figures, 21 tables.

Introduction
Preliminary: Transformer
Multi-Head Self-Attention
Feed-Forward Network
PartialFormer
Overall Architecture
Encoder.
Decoder.
Partial-Level Gated FFN
Intuition
Design of PG-FFNs
Residual-like Attention Calculation
Efficient Scaling Strategy
Head Scaling
Experimental Setups
...and 45 more sections

Figures (5)

Figure 1: Illustration of our idea.
Figure 2: (a) Architecture of Transformer. (b) Architecture of PartialFormer. (c) Details of Self-AFFN Block. All architecture are based on pre-normalization strategy. We omit the layer normalization operation, residual connection, softmax operation and scale coefficient for simplicity.
Figure 3: (a) Scaling Up PartialFormer with Different Methods. (b) Scaling Transformer and PartialFormer with Head Scaling.
Figure 4: Analysis on behaviours of FFNs and head diversity in Transformer and PartialFormer.
Figure 5: Comparison of token uniformity (lower is better) in Transformer and PartialFormer.

PartialFormer: Modeling Part Instead of Whole for Machine Translation

TL;DR

Abstract

PartialFormer: Modeling Part Instead of Whole for Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)