Table of Contents
Fetching ...

Rewiring the Transformer with Depth-Wise LSTMs

Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, Deyi Xiong

TL;DR

This work replaces Transformer residual connections with depth-wise LSTMs that connect stacked layers, enabling selective cross-layer fusion and absorbing layer norm and FFN computations into the LSTM framework. The approach yields significant BLEU improvements on standard WMT tasks (En-De, En-Fr) and the OPUS-100 multilingual NMT benchmark, while also enabling deeper Transformer stacks to converge efficiently. Across extensive ablations, depth-wise LSTMs demonstrate favorable parameter efficiency and decoding speed compared to deeper vanilla Transformers. The results suggest depth-wise LSTMs as a practical mechanism for improving cross-layer information integration in deep Transformer architectures, especially in high-capacity multilingual settings.

Abstract

Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.

Rewiring the Transformer with Depth-Wise LSTMs

TL;DR

This work replaces Transformer residual connections with depth-wise LSTMs that connect stacked layers, enabling selective cross-layer fusion and absorbing layer norm and FFN computations into the LSTM framework. The approach yields significant BLEU improvements on standard WMT tasks (En-De, En-Fr) and the OPUS-100 multilingual NMT benchmark, while also enabling deeper Transformer stacks to converge efficiently. Across extensive ablations, depth-wise LSTMs demonstrate favorable parameter efficiency and decoding speed compared to deeper vanilla Transformers. The results suggest depth-wise LSTMs as a practical mechanism for improving cross-layer information integration in deep Transformer architectures, especially in high-capacity multilingual settings.

Abstract

Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.

Paper Structure

This paper contains 19 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Depth-wise LSTM computation.
  • Figure 2: Encoder layer with depth-wise LSTM.
  • Figure 3: Decoder layer with depth-wise LSTM.