Efficient transformer with reinforced position embedding for language models
Yen-Che Hsiao, Abhishek Dutta
TL;DR
The paper addresses efficient language modeling by proposing a reinforced positional embedding strategy integrated into an encoder–decoder Transformer. It introduces three modifications: column-wise normalization of token embeddings, concatenation of token and positional embeddings before the first blocks, and using the normalized token embeddings as attention values. Empirical results on Portuguese–English translation and 14 Ye2018WordEmbeddings datasets show the approach achieves lower training and validation losses with roughly 2–4x reductions in training time and about a threefold decrease in parameters compared to a deeper baseline, with one dataset (Belarusian–English) being the notable exception. This work suggests a practical route to more parameter- and time-efficient Transformers with robust cross-dataset improvements in translation tasks.
Abstract
In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.
