Table of Contents
Fetching ...

Higher-Order Transformers With Kronecker-Structured Attention

Soroush Omranpour, Guillaume Rabusseau, Reihaneh Rabbany

TL;DR

Higher-Order Transformer (HOT) introduces a Kronecker-structured attention mechanism to model multiway tensor data without flattening, reducing the quadratic bottleneck of standard self-attention. By decomposing the attention into mode-wise components via Kronecker products or sums, HOT preserves tensor structure and enables scalable modeling with a controllable rank parameter, while maintaining expressiveness relative to full high-order attention. The paper provides theoretical results on stable rank and a universality guarantee: any high-order attention can be approximated as a sum of Kronecker products with increasing rank, and analyzes complexity bounds of $O(D ( extstyle ext{sum}_i N_i) extstyleig( extstyle ext{prod}_j N_jig))$ per layer. Empirically, HOT achieves competitive or state-of-the-art performance on multivariate time-series forecasting, 3D medical image classification, and multispectral image segmentation, with substantially reduced memory and FLOPs and interpretable mode-wise attention maps. These results suggest HOT as a practical, efficient framework for learning complex cross-dimensional dependencies in high-dimensional data across diverse domains.

Abstract

Modern datasets are increasingly high-dimensional and multiway, often represented as tensor-valued data with multi-indexed variables. While Transformers excel in sequence modeling and high-dimensional tasks, their direct application to multiway data is computationally prohibitive due to the quadratic cost of dot-product attention and the need to flatten inputs, which disrupts tensor structure and cross-dimensional dependencies. We propose the Higher-Order Transformer (HOT), a novel factorized attention framework that represents multiway attention as sums of Kronecker products or sums of mode-wise attention matrices. HOT efficiently captures dense and sparse relationships across dimensions while preserving tensor structure. Theoretically, HOT retains the expressiveness of full high-order attention and allows complexity control via factorization rank. Experiments on 2D and 3D datasets show that HOT achieves competitive performance in multivariate time series forecasting and image classification, with significantly reduced computational and memory costs. Visualizations of mode-wise attention matrices further reveal interpretable high-order dependencies learned by HOT, demonstrating its versatility for complex multiway data across diverse domains. The implementation of our proposed method is publicly available at https://github.com/s-omranpour/HOT.

Higher-Order Transformers With Kronecker-Structured Attention

TL;DR

Higher-Order Transformer (HOT) introduces a Kronecker-structured attention mechanism to model multiway tensor data without flattening, reducing the quadratic bottleneck of standard self-attention. By decomposing the attention into mode-wise components via Kronecker products or sums, HOT preserves tensor structure and enables scalable modeling with a controllable rank parameter, while maintaining expressiveness relative to full high-order attention. The paper provides theoretical results on stable rank and a universality guarantee: any high-order attention can be approximated as a sum of Kronecker products with increasing rank, and analyzes complexity bounds of per layer. Empirically, HOT achieves competitive or state-of-the-art performance on multivariate time-series forecasting, 3D medical image classification, and multispectral image segmentation, with substantially reduced memory and FLOPs and interpretable mode-wise attention maps. These results suggest HOT as a practical, efficient framework for learning complex cross-dimensional dependencies in high-dimensional data across diverse domains.

Abstract

Modern datasets are increasingly high-dimensional and multiway, often represented as tensor-valued data with multi-indexed variables. While Transformers excel in sequence modeling and high-dimensional tasks, their direct application to multiway data is computationally prohibitive due to the quadratic cost of dot-product attention and the need to flatten inputs, which disrupts tensor structure and cross-dimensional dependencies. We propose the Higher-Order Transformer (HOT), a novel factorized attention framework that represents multiway attention as sums of Kronecker products or sums of mode-wise attention matrices. HOT efficiently captures dense and sparse relationships across dimensions while preserving tensor structure. Theoretically, HOT retains the expressiveness of full high-order attention and allows complexity control via factorization rank. Experiments on 2D and 3D datasets show that HOT achieves competitive performance in multivariate time series forecasting and image classification, with significantly reduced computational and memory costs. Visualizations of mode-wise attention matrices further reveal interpretable high-order dependencies learned by HOT, demonstrating its versatility for complex multiway data across diverse domains. The implementation of our proposed method is publicly available at https://github.com/s-omranpour/HOT.

Paper Structure

This paper contains 52 sections, 7 theorems, 26 equations, 13 figures, 10 tables.

Key Result

Proposition 3.8

For any tensor $\mathcal{T} \in \mathbb R^{N_1 \times N_2 \times \dots \times N_k \times d}$ of order $k+1$ and any matrices $A_1 \in \mathbb R^{M_1\times N_1},\cdots, A_k \in \mathbb R^{M_k\times N_k}$, we have $(\mathcal{T} \times_1 A_1 \times_2 A_2 \times_3 \cdots \times_k A_k)_{(k+1)} = \mathcal

Figures (13)

  • Figure 1: Overall structure of High Order Transformer (HOT) depicting the proposed method for 2D data with size $N_1 \times N_2 \times D$. The model shares the same arrangement as the Transformer encoder while employing Kronecker Factorized Multihead Attention to reduce the computational complexity. Each mode of the tensor (e.g., $N_1, N_2$) has its own attention matrix, combined using Kronecker product operations.
  • Figure 2: Visualization of a rank $R$ Kronecker Decomposition of a high-order full attention matrix $S \in \mathbb{R}^{N_1N_2...N_k \times N_1N_2...N_k}$ with factor matrices $S_i \in \mathbb{R}^{N_i \times N_i}$. Note that the actual full attention matrix on the LHS can be potentially much larger than what is depicted in the figure.
  • Figure 3: Evolution of the average stable rank for mode-wise attention matrices across training steps for HOT (product) and HOT (sum) models.
  • Figure 4: Evolution of the average stable rank for full attention matrices across training steps. Results are shown for the Transformer (non-factorized), HOT (sum), and HOT (product) models.
  • Figure 5: Effect of increasing the number of heads (i.e., factorization rank) on model performance with Kronecker product, Kronecker sum, and non-factorized attention. Left: Multivariate time series datasets. Right: 3D medical imaging datasets.
  • ...and 8 more figures

Theorems & Definitions (20)

  • Definition 3.1: Tensor
  • Definition 3.2: Tensor Mode and Fibers
  • Definition 3.3: Tensor Slice
  • Definition 3.4: Tensor Matricization
  • Definition 3.5: Mode $n$ tensor product
  • Definition 3.6: Rank-1 Tensor
  • Definition 3.7: Kronecker Product and Sum
  • Proposition 3.8
  • Theorem 4.1: Row stochastic property
  • proof
  • ...and 10 more