Table of Contents
Fetching ...

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao, Martin Ester, Ke Li

TL;DR

Transformers with self-attention incur quadratic complexity $O(mn)$ in sequence length, hindering CPU deployment for long inputs. IceFormer achieves retraining-free acceleration by embedding keys and queries into a higher-dimensional space via $T_K$ and $T_Q$, then applying a fast $k$-NN search to identify the top-$k$ contributing keys without computing all attention weights. It yields speedups from $2.73\times$ to $7.63\times$ on long-context benchmarks while preserving $98.6\%$--$99.6\%$ of accuracy, and up to $3.0\times$ speedups on LLM prompts with minimal loss. By enabling efficient, general, and accurate CPU inference for long sequences, IceFormer facilitates practical deployment of decoder-only and encoder–decoder transformers without retraining.

Abstract

One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

TL;DR

Transformers with self-attention incur quadratic complexity in sequence length, hindering CPU deployment for long inputs. IceFormer achieves retraining-free acceleration by embedding keys and queries into a higher-dimensional space via and , then applying a fast -NN search to identify the top- contributing keys without computing all attention weights. It yields speedups from to on long-context benchmarks while preserving -- of accuracy, and up to speedups on LLM prompts with minimal loss. By enabling efficient, general, and accurate CPU inference for long sequences, IceFormer facilitates practical deployment of decoder-only and encoder–decoder transformers without retraining.

Abstract

One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.
Paper Structure (35 sections, 9 equations, 9 figures, 7 tables, 3 algorithms)

This paper contains 35 sections, 9 equations, 9 figures, 7 tables, 3 algorithms.

Figures (9)

  • Figure 1: Comparison between Transformer vaswani2017attention (top row) and the proposed method, IceFormer (bottom row). We illustrate with one query and $k=2$ in $k$-NNS. In the two attention matrices presented, the top-2 largest attention weights in each row are represented by a dark color. The remaining attention weights are shown in a pale color in the vanilla attention matrix, and are set to zero (depicted in white) in the sparse attention matrix.
  • Figure 2: Difference between ranking-based and bucketing-based $k$-NNS. Left: illustration of two $k$-NNS methods, Prioritized DCI (ranking-based) and LSH (bucketing-based). Right: the number of keys whose projections are less than a threshold. Ranking-based algorithms return a fixed number of keys are most similar to the query under projection (shown as a fixed-size row), which effectively filters out points outside a variable-sized window on the projections. Bucketing-based algorithms use a fixed-size window (shown as a fixed-size column) and return all keys whose projections lie within it.
  • Figure 3: Comparison between twelve$k$-NNS algorithms on fashion-mnist-784 dataset. There are in total 60,000 keys and 10,000 queries with 784 dimensions. The task is to find top-10 closest neighbours from the entire set of keys for every query. X-axis: Average recall across all the queries; Y-axis: Total latency (seconds) including database construction and querying.
  • Figure 4: Tradeoff between speed and accuracy as $k$ varies on five LRA tasks. The horizontal axis of each plot is the averaged wall clock time of attention module, and the vertical axis is the model prediction accuracy. Each point corresponds to a value of $k$ in the following set: {3, 5, 8, 10}.
  • Figure 5: Scalability analysis for IceFormer on the LongEval benchmark. The left figure shows the results of the topic retrieval task; the right figure shows the results of the line retrieval task. X-axis: length of the input prompt; Y-axis (Left): retrieval accuracy; Y-axis (Right): averaged process wall clock time (second) of the attention module.
  • ...and 4 more figures