Table of Contents
Fetching ...

DeLighT: Deep and Light-weight Transformer

Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi

TL;DR

DeLighT introduces a deep, light-weight Transformer that reallocates parameters both within blocks, via the DeLighT transformation using group linear transformations, and across blocks, via block-wise scaling. This decouples depth and width from input size, enabling networks that are 2.5–4x deeper yet with fewer parameters and MACs, while matching or beating Transformer baselines on machine translation and language modeling. The approach is supported by extensive ablations and efficiency analyses, showing benefits from feature shuffling, input-mixer connections, and a light-weight FFN. The work demonstrates strong practical impact for parameter-efficient sequence modeling and points to broad applicability beyond the reported tasks.

Abstract

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling, which allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. Our source code is available at: \url{https://github.com/sacmehta/delight}

DeLighT: Deep and Light-weight Transformer

TL;DR

DeLighT introduces a deep, light-weight Transformer that reallocates parameters both within blocks, via the DeLighT transformation using group linear transformations, and across blocks, via block-wise scaling. This decouples depth and width from input size, enabling networks that are 2.5–4x deeper yet with fewer parameters and MACs, while matching or beating Transformer baselines on machine translation and language modeling. The approach is supported by extensive ablations and efficiency analyses, showing benefits from feature shuffling, input-mixer connections, and a light-weight FFN. The work demonstrates strong practical impact for parameter-efficient sequence modeling and points to broad applicability beyond the reported tasks.

Abstract

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling, which allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. Our source code is available at: \url{https://github.com/sacmehta/delight}

Paper Structure

This paper contains 20 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: (a, b) Block-wise comparison between the standard transformer block of vaswani2017attention and the DeLighT block. In the DeLighT transformation, the number of operations in computing attention are reduced by half while the number of parameters (and operations) in the FFN are reduced by $16\times$. Transformations with learnable parameters (Linear and DeLighT) are shown in color. The shape of linear transformations indicate their operation (expansion, reduction, etc.). (c, d) compares the DeFINE transformation mehta2020DeFINE with the DeLighT transformation. Compared to the DeFINE transformation, the DeLighT transformation uses group linear transformations (GLTs) with more groups to learn wider representations with fewer parameters. Different colors are used to show groups in GLTs. For simplicity, feature shuffling is not shown in (d).
  • Figure 2: Example illustrating the expansion phase in the DeLighT transformation that uses GLTs, feature shuffling, and an input mixer connection, to learn deeper and wider representations efficiently. For illustrative purposes, we have used the same input and output dimensions.
  • Figure 2: DeLighT networks are deep, light-weight and efficient as compared to transformers. BLEU score is reported on the WMT'14 En-Fr dataset. To compute network depth, we count the number of sequential layers in the network (Section \ref{['ssec:layer_wise_scaling']}). We used 20 source and 20 target tokens for computing multiplication-addition operations (MACs). See Appendex \ref{['sec:appendix_mac']} for details.
  • Figure 3: Block-wise scaling efficiently allocates parameters and operations across blocks, leading to shallower and narrower DeLighT blocks near the input and deeper and wider DeLighT blocks near the output. In (b), DeLighT networks with both uniform ($N$=$N_{min}$=$N_{max}$=8) and block-wise ($N_{min}$=4, $N_{max}$=8) scaling have about 16.7 M parameters and perform 3.5 B operations (computed for a sequence length of $n=30$), however, the DeLighT network with block-wise scaling delivered 2 points better perplexity.
  • Figure 4: Comparison of DeLighT with Transformers and Evolved Transformers at two different settings, on the WMT'14 En-De corpus: (1) the number of parameters is the same and (2) the performance is the same.
  • ...and 8 more figures