Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation
Jia Cheng Hu, Roberto Cavicchioli, Giulia Berardinelli, Alessandro Capotondi
TL;DR
The paper investigates whether Transformer-based neural machine translation benefits from incorporating multiple heterogeneous encoders. It introduces the Multi-Encoder Transformer, which combines up to five encoder types (LSTM, ConvS2S, Self-Attention, Static Expansion, and FNet) using a simple summation fusion and evaluates on five translation tasks. Key findings show that while single-encoder Self-Attention remains strongest, dual-encoder configurations consistently improve performance, especially for low-resource languages (up to 7.16 BLEU), whereas adding more encoders yields mixed results and higher compute. These results demonstrate the potential of encoder diversity to boost translation quality and highlight trade-offs between performance and computational cost, suggesting directions for future, more sophisticated fusion mechanisms.
Abstract
Although the Transformer is currently the best-performing architecture in the homogeneous configuration (self-attention only) in Neural Machine Translation, many State-of-the-Art models in Natural Language Processing are made of a combination of different Deep Learning approaches. However, these models often focus on combining a couple of techniques only and it is unclear why some methods are chosen over others. In this work, we investigate the effectiveness of integrating an increasing number of heterogeneous methods. Based on a simple combination strategy and performance-driven synergy criteria, we designed the Multi-Encoder Transformer, which consists of up to five diverse encoders. Results showcased that our approach can improve the quality of the translation across a variety of languages and dataset sizes and it is particularly effective in low-resource languages where we observed a maximum increase of 7.16 BLEU compared to the single-encoder model.
