IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Yuzhen Mao, Martin Ester, Ke Li
TL;DR
Transformers with self-attention incur quadratic complexity $O(mn)$ in sequence length, hindering CPU deployment for long inputs. IceFormer achieves retraining-free acceleration by embedding keys and queries into a higher-dimensional space via $T_K$ and $T_Q$, then applying a fast $k$-NN search to identify the top-$k$ contributing keys without computing all attention weights. It yields speedups from $2.73\times$ to $7.63\times$ on long-context benchmarks while preserving $98.6\%$--$99.6\%$ of accuracy, and up to $3.0\times$ speedups on LLM prompts with minimal loss. By enabling efficient, general, and accurate CPU inference for long sequences, IceFormer facilitates practical deployment of decoder-only and encoder–decoder transformers without retraining.
Abstract
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.
