Table of Contents
Fetching ...

LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions

Victor Agostinelli, Sanghyun Hong, Lizhong Chen

TL;DR

The paper addresses the inefficiency of standard attention in transformers by introducing Learned Proportions (LeaP) and LeaPformers, which replace fixed positional cues with learnable, proportion-based re-weighting to operate without known sequence lengths. A compact LeaP module derives token proportions $P_q$ and $P_k$ that modulate attention scores via a generalized form $S(Q_{h,i}, K_{h,j}^T)=S_q(Q_{h,i})S_k(K_{h,j}^T)\sigma(P_{q,i},P_{k,j})$, with cosFormer recovered as a special case. Empirically, LeaPformers achieve the best quality-throughput trade-off on the Long-Range Arena, and show competitive autoregressive language modeling results on Wikitext-103b and strong, near-linear performance in SimulST across two language pairs, illustrating the method's versatility for length-unknown scenarios. The work demonstrates that dynamic, learned attention concentration patterns can outperform static re-weighting, offering practical efficiency gains for a wide range of sequence modeling tasks.

Abstract

A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it difficult or impossible to apply them to autoregressive and simultaneous tasks, where the target and sometimes even the input sequence length are unknown. To address this issue, we propose Learned Proportions (LeaP) and LeaPformers. Our contribution is built on two major components. First, we generalize the dependence on explicit positional representations and sequence lengths into dependence on sequence proportions for re-weighting. Second, we replace static positional representations with dynamic proportions derived via a compact module, enabling more flexible attention concentration patterns. We evaluate LeaPformer against eight representative efficient transformers on the Long-Range Arena benchmark, showing that LeaPformer achieves the best quality-throughput trade-off, as well as LeaPformer to Wikitext-103 autoregressive language modeling and simultaneous speech-to-text translation for two language pairs, achieving competitive results.

LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions

TL;DR

The paper addresses the inefficiency of standard attention in transformers by introducing Learned Proportions (LeaP) and LeaPformers, which replace fixed positional cues with learnable, proportion-based re-weighting to operate without known sequence lengths. A compact LeaP module derives token proportions and that modulate attention scores via a generalized form , with cosFormer recovered as a special case. Empirically, LeaPformers achieve the best quality-throughput trade-off on the Long-Range Arena, and show competitive autoregressive language modeling results on Wikitext-103b and strong, near-linear performance in SimulST across two language pairs, illustrating the method's versatility for length-unknown scenarios. The work demonstrates that dynamic, learned attention concentration patterns can outperform static re-weighting, offering practical efficiency gains for a wide range of sequence modeling tasks.

Abstract

A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it difficult or impossible to apply them to autoregressive and simultaneous tasks, where the target and sometimes even the input sequence length are unknown. To address this issue, we propose Learned Proportions (LeaP) and LeaPformers. Our contribution is built on two major components. First, we generalize the dependence on explicit positional representations and sequence lengths into dependence on sequence proportions for re-weighting. Second, we replace static positional representations with dynamic proportions derived via a compact module, enabling more flexible attention concentration patterns. We evaluate LeaPformer against eight representative efficient transformers on the Long-Range Arena benchmark, showing that LeaPformer achieves the best quality-throughput trade-off, as well as LeaPformer to Wikitext-103 autoregressive language modeling and simultaneous speech-to-text translation for two language pairs, achieving competitive results.
Paper Structure (40 sections, 13 equations, 4 figures, 10 tables)

This paper contains 40 sections, 13 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Contrasting accuracy-throughput trade-offs between LeaPformers and baselines. Throughput for 4k sequence length tasks (x-axis) vs. average accuracy score (y-axis) across the five tasks in the Long-Range Arena benchmark. The memory footprint of each mechanisim is specified in labels and represented by circle size. LeaPformers provide the best average inference accuracy-throughput trade-off while achieving the second best overall score with only up to a 1.5% increase to parameter count.
  • Figure 2: Illustration of the proposed Learned Proportions (LeaP) augmentation to linear transformer attention mechanisms. The LeaP module takes each token of the query and key matrices and reduces their dimensions to a single proportion.
  • Figure 3: An example of re-weighting matrices across all query (y-axis) and key (x-axis) token positions for baseline cosFormer (left) and LeaPformer (right) on list-operations in the Long-Range Arena benchmark. In this example, LeaPformer has clearly learned to attenuate more dynamically as opposed to the locality-focused, diagonalized re-weighting matrix of cosFormer.
  • Figure 4: Re-weighting heatmaps of decoder cross-attention during various alternative strategies to LeaP: stepping sequence length (left), max sequence length (middle), stepping the max sequence length intermittently (right).