Towards Sampling Data Structures for Tensor Products in Turnstile Streams
Zhao Song, Shenghao Xie, Samson Zhou
TL;DR
This work tackles the computational bottleneck of attention in large-scale models by introducing attention samplers that identify and sample the most informative coordinates in the attention computation. It formalizes an attention sampler based on a generalized distribution over coordinates, connects linear attention to a tensor-product formulation via $A = A_1 \otimes A_2$ and $x = \operatorname{vec}(X)$, and develops streaming algorithms with space/update guarantees. The paper establishes a hardness result for softmax attention via an $\Omega(n)$ space lower bound and provides upper bounds for $\ell_2$-based samplers under various streaming update scenarios, including the tensor-version problem with $A_1 \otimes A_2$. Together, these results offer a principled, scalable framework for efficient streaming attention and sparse-attention schemes with potential applications in streaming LLMs and real-time inference. The findings advance practical subquadratic attention methods by balancing accuracy, space, and update-efficiency while connecting to rich streams of sketching and sampling literature.
Abstract
This paper studies the computational challenges of large-scale attention-based models in artificial intelligence by utilizing importance sampling methods in the streaming setting. Inspired by the classical definition of the $\ell_2$ sampler and the recent progress of the attention scheme in Large Language Models (LLMs), we propose the definition of the attention sampler. Our approach significantly reduces the computational burden of traditional attention mechanisms. We analyze the effectiveness of the attention sampler from a theoretical perspective, including space and update time. Additionally, our framework exhibits scalability and broad applicability across various model architectures and domains.
