Table of Contents
Fetching ...

Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

Andrew Kiruluta, Priscilla Burity, Samantha Williams

TL;DR

The paper tackles the quadratic time and memory bottleneck of self-attention in transformers by introducing the Learnable Multi-Scale Wavelet Transformer (LMWT), which replaces attention with a learnable Haar wavelet module that processes sequences in a multiscale, linear-time fashion. By learning scale-specific coefficients, the model captures both local details and global context while reducing complexity from $O(T^2d)$ to $O(Td)$. Experimental results on WMT16 En-De show BLEU scores competitive with a standard transformer, along with faster training throughput and interpretable Haar coefficient representations. This approach offers a promising, scalable alternative for efficient sequence modeling, with potential extensions to other tasks and wavelet families.

Abstract

Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.

Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

TL;DR

The paper tackles the quadratic time and memory bottleneck of self-attention in transformers by introducing the Learnable Multi-Scale Wavelet Transformer (LMWT), which replaces attention with a learnable Haar wavelet module that processes sequences in a multiscale, linear-time fashion. By learning scale-specific coefficients, the model captures both local details and global context while reducing complexity from to . Experimental results on WMT16 En-De show BLEU scores competitive with a standard transformer, along with faster training throughput and interpretable Haar coefficient representations. This approach offers a promising, scalable alternative for efficient sequence modeling, with potential extensions to other tasks and wavelet families.

Abstract

Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.

Paper Structure

This paper contains 24 sections, 13 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the proposed learnable multi-scale Haar wavelet transformer architecture (LMWT). The diagram depicts the flow from input tokens through embedding, positional encoding, the LMWT block (containing the multi-scale Haar module and FFN), and potentially subsequent layers culminating in the final output (e.g., decoder output or classification head).
  • Figure 2: Example heatmap visualization of learned Haar-like coefficients from a trained LMWT module. The horizontal axis represents token positions within the sequence, and the vertical axis denotes the feature dimension ($d=512$). Different rows or blocks could correspond to different scales ($l$). The structured patterns (bands, oscillations) suggest the model learns meaningful multi-resolution features.