Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality
Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai
TL;DR
Attention scalability is a critical bottleneck for multivariate time-series modeling. The authors introduce Entropy-Aware Linear Attention (EALA) and its practical instantiation ELinFormer, grounded in the entropy-equality principle and the concavity of entropy $H$ on the probability simplex. They derive a linear-time entropy approximation for dot-product attention and provide an analytic solution for a tunable parameter $\theta_q^*$, enabling linear-complexity attention that can replace traditional softmax-based attention in existing architectures. Empirical results on real-world spatio-temporal datasets show competitive forecasting accuracy with substantial reductions in memory and computation, suggesting that balanced weight distributions—not nonlinear softmax alone—drive attention effectiveness in these settings. The work offers a scalable, graph-agnostic attention substitute with potential applicability to long-range time-series and NLP tasks.
Abstract
Attention mechanisms have been extensively employed in various applications, including time series modeling, owing to their capacity to capture intricate dependencies; however, their utility is often constrained by quadratic computational complexity, which impedes scalability for long sequences. In this work, we propose a novel linear attention mechanism designed to overcome these limitations. Our approach is grounded in a theoretical demonstration that entropy, as a strictly concave function on the probability simplex, implies that distributions with aligned probability rankings and similar entropy values exhibit structural resemblance. Building on this insight, we develop an efficient approximation algorithm that computes the entropy of dot-product-derived distributions with only linear complexity, enabling the implementation of a linear attention mechanism based on entropy equality. Through rigorous analysis, we reveal that the effectiveness of attention in spatio-temporal time series modeling may not primarily stem from the non-linearity of softmax but rather from the attainment of a moderate and well-balanced weight distribution. Extensive experiments on four spatio-temporal datasets validate our method, demonstrating competitive or superior forecasting performance while achieving substantial reductions in both memory usage and computational time.
