Table of Contents
Fetching ...

Efficient transformer with reinforced position embedding for language models

Yen-Che Hsiao, Abhishek Dutta

TL;DR

The paper addresses efficient language modeling by proposing a reinforced positional embedding strategy integrated into an encoder–decoder Transformer. It introduces three modifications: column-wise normalization of token embeddings, concatenation of token and positional embeddings before the first blocks, and using the normalized token embeddings as attention values. Empirical results on Portuguese–English translation and 14 Ye2018WordEmbeddings datasets show the approach achieves lower training and validation losses with roughly 2–4x reductions in training time and about a threefold decrease in parameters compared to a deeper baseline, with one dataset (Belarusian–English) being the notable exception. This work suggests a practical route to more parameter- and time-efficient Transformers with robust cross-dataset improvements in translation tasks.

Abstract

In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.

Efficient transformer with reinforced position embedding for language models

TL;DR

The paper addresses efficient language modeling by proposing a reinforced positional embedding strategy integrated into an encoder–decoder Transformer. It introduces three modifications: column-wise normalization of token embeddings, concatenation of token and positional embeddings before the first blocks, and using the normalized token embeddings as attention values. Empirical results on Portuguese–English translation and 14 Ye2018WordEmbeddings datasets show the approach achieves lower training and validation losses with roughly 2–4x reductions in training time and about a threefold decrease in parameters compared to a deeper baseline, with one dataset (Belarusian–English) being the notable exception. This work suggests a practical route to more parameter- and time-efficient Transformers with robust cross-dataset improvements in translation tasks.

Abstract

In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing columns in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.
Paper Structure (12 sections, 35 equations, 3 figures, 1 table)

This paper contains 12 sections, 35 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a) The matrix on the left-hand side is a token embedding matrix with the number of rows the same as the number of tokens and with the number of columns the same as the number of token features for the tokens. Each row in the token embedding matrix corresponds to the token feature vector for the token at that row. The matrix at the right-hand side is a positional embedding matrix with the number of rows the same as the number of tokens and with the number of columns the same as the number of features at that position (row). Each row in the positional embedding matrix is the feature vector correspond to that position (row). (b) We normalize each column of the token embedding matrix to make each column having elements with zero mean and unit variance. (c) After the token feature normalization, we concatenate the positional embedding matrix to the right of the normalized token embedding matrix. (d) For the scaled dot-product attention in each attention layer, the value is the normalized token embedding matrix from the input. Created with BioRender.com.
  • Figure 2: The proposed modified transformer architecture. We made three modification from the transformer architecture in vaswani2017attention. Firstly, each column in the token embedding matrix is normalized to have zero mean and unit variance for both the encoder and decoder. Secondly, The token embedding matrix and the positional embedding matrix is concatenated before the first encoder block and the first decoder block. Lastly, each attention layer has the value without concatenation. Created with BioRender.com.
  • Figure 3: (a) The transparent blue dashed lines and the transparent blue solid lines are the training loss and the validation loss of the baseline transformer model on the Portuguese to English translation dataset from Ye2018WordEmbeddings, respectively. The transparent brown dashed lines and the transparent brown solid lines are the training loss and the validation loss of the proposed transformer model on the Portuguese to English translation dataset from Ye2018WordEmbeddings, respectively. Each line represent the training or validation loss for one trial. Each model are trained for 10 epochs or 12 hours. The baseline model has 4 trials that finished 10 epochs of training; 4 trails finished 9 epochs of training; 2 trials finished 8 epochs of training. The proposed model finished 10 epochs of training for all the trials. The training losses of the baseline have the mean of roughly 6.60, 4.55, 3.82, 3.29, 2.89, 2.57, 2.30, 2.11, 1.96, and 1.84 and the mean of the validation losses of the baseline are roughly 5.04, 4.06, 3.46, 3.01, 2.77, 2.49, 2.35, 2.25, and 2.18 for 10 different trials. The training losses of the proposed model have the mean of roughly 6.68, 4.55, 3.62, 2.89, 2.36, 1.97, 1.67, 1.46, 1.32, and 1.21 and the mean of the validation losses of the proposed model are roughly 5.08, 3.93, 3.10, 2.51, 2.17, 1.88, 1.72, 1.63, 1.56, and 1.51 for 10 different trials. The proposed model shows a lower mean of training losses after 3 epochs and a lower mean of validation losses after 2 epochs. (b) The bar plot shows the average training time per epoch for both the baseline and the proposed model on the Portuguese to English translation dataset from Ye2018WordEmbeddings. The average training time per epoch for the baseline is roughly 4297.79 seconds, which are higher than roughly 1352.27 seconds for the proposed model. In addition, The variance of the training time for the baseline is roughly 675.79 seconds, which are higher than roughly 144.50 seconds for the proposed model.