H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Zhenhai Zhu, Radu Soricut
TL;DR
This work tackles the quadratic attention bottleneck in Transformers by introducing H-Transformer-1D, a hierarchical attention mechanism inspired by Hierarchical Matrix (H-Matrix) and Multigrid methods that achieves linear complexity in both time and memory, scaling as $O(dL)$. By constructing a multi-scale token hierarchy and applying distance-dependent low-rank approximations, the approach preserves near interactions while efficiently approximating longer-range dependencies. Empirically, it reaches state-of-the-art performance on the Long Range Arena and sets a new perplexity low on the One-Billion Word dataset using substantially fewer parameters than prior Transformer models. The results support the proposed inductive bias and point to broad potential for efficient long-range modeling in NLP and vision tasks, with avenues for extending to cross-attention and 2D data.
Abstract
We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
