Table of Contents
Fetching ...

Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality

Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai

TL;DR

Attention scalability is a critical bottleneck for multivariate time-series modeling. The authors introduce Entropy-Aware Linear Attention (EALA) and its practical instantiation ELinFormer, grounded in the entropy-equality principle and the concavity of entropy $H$ on the probability simplex. They derive a linear-time entropy approximation for dot-product attention and provide an analytic solution for a tunable parameter $\theta_q^*$, enabling linear-complexity attention that can replace traditional softmax-based attention in existing architectures. Empirical results on real-world spatio-temporal datasets show competitive forecasting accuracy with substantial reductions in memory and computation, suggesting that balanced weight distributions—not nonlinear softmax alone—drive attention effectiveness in these settings. The work offers a scalable, graph-agnostic attention substitute with potential applicability to long-range time-series and NLP tasks.

Abstract

Attention mechanisms have been extensively employed in various applications, including time series modeling, owing to their capacity to capture intricate dependencies; however, their utility is often constrained by quadratic computational complexity, which impedes scalability for long sequences. In this work, we propose a novel linear attention mechanism designed to overcome these limitations. Our approach is grounded in a theoretical demonstration that entropy, as a strictly concave function on the probability simplex, implies that distributions with aligned probability rankings and similar entropy values exhibit structural resemblance. Building on this insight, we develop an efficient approximation algorithm that computes the entropy of dot-product-derived distributions with only linear complexity, enabling the implementation of a linear attention mechanism based on entropy equality. Through rigorous analysis, we reveal that the effectiveness of attention in spatio-temporal time series modeling may not primarily stem from the non-linearity of softmax but rather from the attainment of a moderate and well-balanced weight distribution. Extensive experiments on four spatio-temporal datasets validate our method, demonstrating competitive or superior forecasting performance while achieving substantial reductions in both memory usage and computational time.

Efficient Linear Attention for Multivariate Time Series Modeling via Entropy Equality

TL;DR

Attention scalability is a critical bottleneck for multivariate time-series modeling. The authors introduce Entropy-Aware Linear Attention (EALA) and its practical instantiation ELinFormer, grounded in the entropy-equality principle and the concavity of entropy on the probability simplex. They derive a linear-time entropy approximation for dot-product attention and provide an analytic solution for a tunable parameter , enabling linear-complexity attention that can replace traditional softmax-based attention in existing architectures. Empirical results on real-world spatio-temporal datasets show competitive forecasting accuracy with substantial reductions in memory and computation, suggesting that balanced weight distributions—not nonlinear softmax alone—drive attention effectiveness in these settings. The work offers a scalable, graph-agnostic attention substitute with potential applicability to long-range time-series and NLP tasks.

Abstract

Attention mechanisms have been extensively employed in various applications, including time series modeling, owing to their capacity to capture intricate dependencies; however, their utility is often constrained by quadratic computational complexity, which impedes scalability for long sequences. In this work, we propose a novel linear attention mechanism designed to overcome these limitations. Our approach is grounded in a theoretical demonstration that entropy, as a strictly concave function on the probability simplex, implies that distributions with aligned probability rankings and similar entropy values exhibit structural resemblance. Building on this insight, we develop an efficient approximation algorithm that computes the entropy of dot-product-derived distributions with only linear complexity, enabling the implementation of a linear attention mechanism based on entropy equality. Through rigorous analysis, we reveal that the effectiveness of attention in spatio-temporal time series modeling may not primarily stem from the non-linearity of softmax but rather from the attainment of a moderate and well-balanced weight distribution. Extensive experiments on four spatio-temporal datasets validate our method, demonstrating competitive or superior forecasting performance while achieving substantial reductions in both memory usage and computational time.

Paper Structure

This paper contains 12 sections, 1 theorem, 8 equations, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $\Delta^n$ denote the $n$-dimensional probability simplex defined as: The entropy function $H: \Delta^n \to \mathbb{R}$ is given by: where $\log$ denotes the natural logarithm. And the KL-divergence is defined as: Then the following properties hold: Implication under Consistent Ordering: If two probability distributions $\mathbf{p}, \mathbf{q} \in \Delta^n$ have the same ordering (i.e., for

Theorems & Definitions (1)

  • Proposition 1