Efficient Transformers with Dynamic Token Pooling

Piotr Nawrot; Jan Chorowski; Adrian Łańcucki; Edoardo M. Ponti

Efficient Transformers with Dynamic Token Pooling

Piotr Nawrot, Jan Chorowski, Adrian Łańcucki, Edoardo M. Ponti

TL;DR

This work tackles the inefficiency of Transformers by introducing a dynamic-pooling mechanism that learns variable-length token segments in intermediate layers, preserving autoregressive generation. A boundary predictor jointly learns segmentation and language modelling, with training signals ranging from end-to-end Gumbel-Sigmoid to supervision via Unigram tokenization, entropy spikes, or whitespace cues. Empirical results across English and morphologically diverse languages show that dynamic pooling yields faster training and improved perplexity (lower BPC) compared to vanilla and fixed-pooling Hourglass models, with whitespace and Unigram supervision performing best. The approach scales well, reduces memory and time by substantial factors at higher shortening rates, and remains competitive or superior as model depth increases, suggesting a promising path for scalable, efficient language modelling. Limitations include language dependence of boundaries (e.g., Finnish’s morphology), the restriction to contiguous segments, and the potential gains from more tightly coupled boundary decisions.

Abstract

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments of tokens. Nevertheless, natural units of meaning, such as words or phrases, display varying sizes. To address this mismatch, we equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. We compare several methods to infer boundaries, including end-to-end learning through stochastic re-parameterisation, supervised learning (based on segmentations from subword tokenizers or spikes in conditional entropy), as well as linguistically motivated boundaries. We perform character-level evaluation on texts from multiple datasets and morphologically diverse languages. The results demonstrate that dynamic pooling, which jointly segments and models language, is both faster and more accurate than vanilla Transformers and fixed-length pooling within the same computational budget.

Efficient Transformers with Dynamic Token Pooling

TL;DR

Abstract

Paper Structure (37 sections, 7 equations, 5 figures, 3 tables)

This paper contains 37 sections, 7 equations, 5 figures, 3 tables.

Introduction
Background
Language Modelling with Transformers
Hourglass Transformer
Dynamic-Pooling Transformer
Boundary Prediction
Segmenting with Gumbel-Sigmoid
Segmenting with Subword Tokenizers
Segmenting with Entropy Spikes
Linguistically Inspired Segments
Pooling and Up-sampling
Auxiliary Objectives
Experimental Setup
Datasets
Models
...and 22 more sections

Figures (5)

Figure 1: The architecture of a dynamic-pooling Transformer, which jointly performs language modelling and token segmentation. The boundary predictor predicts segment boundaries and pools together groups of variable length by averaging. The shortened sequence is processed efficiently by a series of intermediate layers, then up-sampled back to the original length via duplication. The model generates the next token $\pmb{x}_t$ in the same resolution as the input.
Figure 2: Entropy of a Transformer character-level language model in two text segments. Red vertical lines indicate the boundaries according to spikes in conditional entropy. Most of them coincide with whitespaces, due to the high uncertainty at word starts, but they also fall after morphemes like 'great' or 'measure'. Segmentation may vary based on the context, e.g., of the word 'performance'.
Figure 3: Test BPC ($\downarrow$) and shortening factor (SF; $\uparrow$). The higher the SF, the more efficient the model is (cf. Figure \ref{['fig:mem']} in the Appendix). SF increases with higher vocabulary size (Unigram) or smaller prior boundary probability (Gumbel). Dynamic pooling methods shift the Pareto front, i.e., increase performance for the same efficiency (and vice versa). Note that fixed-pooling at $k\mkern1.5mu{=}\mkern1.5mu 1$ corresponds to the vanilla Transformer model.
Figure 4: Test BPC on text8 plotted against the number of Transformer layers for different shortening methods. We use two layers in the first and last transformer block and only scale the middle, downsampled block. There are 28M parameters in models with 8 layers, up to 69M parameters in models with 20 layers. For all variants we observe performance gains with dynamic pooling.
Figure 5: Memory consumption and duration of a training step for different shortening factors on English text8. These results apply to both dynamic pooling and fixed pooling Hourglass models, as well as vanilla Transformers (for SF=1).

Efficient Transformers with Dynamic Token Pooling

TL;DR

Abstract

Efficient Transformers with Dynamic Token Pooling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)