Table of Contents
Fetching ...

Sparse Universal Transformer

Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

TL;DR

Sparse Universal Transformer (SUT) combines Sparse Mixture of Experts and a stick-breaking dynamic halting mechanism to scale parameter-efficient Universal Transformers without proportional compute growth. The approach yields competitive WMT'14 En-De translation performance with roughly half the MACs of dense UTs, and enhances compositional generalization on CFQ and logical inference benchmarks. A post-training halting strategy enables substantial inference-time computation reductions with limited accuracy loss. The results highlight a practical, scalable avenue for depth-parametric sharing in transformers, while acknowledging limitations for larger-scale deployment.

Abstract

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.

Sparse Universal Transformer

TL;DR

Sparse Universal Transformer (SUT) combines Sparse Mixture of Experts and a stick-breaking dynamic halting mechanism to scale parameter-efficient Universal Transformers without proportional compute growth. The approach yields competitive WMT'14 En-De translation performance with roughly half the MACs of dense UTs, and enhances compositional generalization on CFQ and logical inference benchmarks. A post-training halting strategy enables substantial inference-time computation reductions with limited accuracy loss. The results highlight a practical, scalable avenue for depth-parametric sharing in transformers, while acknowledging limitations for larger-scale deployment.

Abstract

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.
Paper Structure (25 sections, 5 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 5 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: A VT has separate Transformer blocks for each layer, with different parameters. For a UT with the same number of parameters, the UT block will be $\sim$3 times the dimensions of each VT block. Running this block for 3 layers would then incur approximately 9 times the runtime memory. Using SMoEs can recover approximately the same computational cost as the VT.
  • Figure 2: Example of the compositional generalization splits from shen2019ordered. The combination of not and and are never seen in successive combination during training, and a VT may learn a shortcut that prevents generalisation during test.
  • Figure 3: Left: Schematic of a SUT block. Right: While the input of each SUT block is the output of the previous layer, the attention mechanism attends to the halted state of the timestep. When the halting probability exceeds $\alpha_\text{thresh}$, the hidden state is simply copied. Finally, the halted state is used as the output of the SUT.
  • Figure 4: The average dynamic halting depth of the UT model as the number of operators increases in the test set. The model learns to think more when the problem is harder.
  • Figure 5: Above: Plot of $1 - \sum_{l'=1}^{l-1} \alpha^{(t)}_{l'}$, for an example Logical Inference input --- $x$-axis: timesteps, $y$-axis: layers. This visualizes the halting pattern of the model: dark blue represents halted, while yellow represents active. Below: Efficiency vs. Performance tradeoff curves when $\alpha_\text{thresh}$ is adjusted.
  • ...and 4 more figures