PartialFormer: Modeling Part Instead of Whole for Machine Translation
Tong Zheng, Bei Li, Huiwen Bao, Jiale Wang, Weiqiao Shan, Tong Xiao, Jingbo Zhu
TL;DR
PartialFormer tackles the Transformer FFN bottleneck by introducing Partial-Level Gated FFNs (PG-FFNs), an ensemble of small, parameter-shared FFNs that preserve or expand the effective hidden dimension while drastically reducing parameters and computation. The PG-FFNs are integrated into both self- and cross-attention blocks, with a residual-like attention calculation and a two-stage hybrid scaling strategy (depth and head scaling) to boost capacity efficiently. Empirical results across nine translation tasks and one abstractive summarization task show that PartialFormer achieves higher or comparable BLEU and ROUGE scores with fewer parameters and MACs than vanilla Transformers and several baselines, aided by improved head diversity and FFN efficiency. The approach remains effective when combined with existing architectures, and ablations substantiate the importance of PG-FFN design choices, gating, and the residual-like attention mechanism for stable optimization and performance gains.
Abstract
The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimensions in designing lightweight FFNs, a factor often overlooked in previous architectures. Guided by this principle, we introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs to reduce parameters and computation while maintaining essential hidden dimensions. These smaller FFNs are integrated into a multi-head attention mechanism for effective collaboration. We also propose a tailored head scaling strategy to enhance PartialFormer's capabilities. Furthermore, we present a residual-like attention calculation to improve depth scaling within PartialFormer. Extensive experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach on machine translation and summarization tasks. Our code would be available at: https://github.com/zhengkid/PartialFormer.
