Sparse Modular Activation for Efficient Sequence Modeling

Liliang Ren; Yang Liu; Shuohang Wang; Yichong Xu; Chenguang Zhu; ChengXiang Zhai

Sparse Modular Activation for Efficient Sequence Modeling

Liliang Ren, Yang Liu, Shuohang Wang, Yichong Xu, Chenguang Zhu, ChengXiang Zhai

TL;DR

This work introduces Sparse Modular Activation (SMA), a differentiable mechanism for dynamically and sparsely activating sub-modules within a neural sequence model, enabling significant efficiency gains without sacrificing expressiveness. The authors instantiate SeqBoat, a layered architecture that combines an efficient State Space Model with a Gated Attention Unit activated by SMA on compressed inputs, augmented by a working memory mechanism that enforces local attention to achieve linear-time inference. Across long-range, speech, and language tasks, SeqBoat achieves state-of-the-art results among linear-complexity hybrids and provides interpretable activation patterns that reveal task-specific attention requirements. The approach offers a practical path toward efficient, scalable sequence modeling and opens avenues for interpretability and integration with larger pre-trained models.

Abstract

Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including long sequence modeling, speech classification and language modeling, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity, and reveals the amount of attention needed for each task through the learned sparse activation patterns. Our code is publicly available at https://github.com/renll/SeqBoat.

Sparse Modular Activation for Efficient Sequence Modeling

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 35 equations, 6 figures, 8 tables)

This paper contains 43 sections, 2 theorems, 35 equations, 6 figures, 8 tables.

Introduction
Background
Time-Invariant Sequence Modeling
Learning Sparse Modular Activation
Time-Variant Sequence Modeling
Sparse Modular Activation
Model Architecture
Working Memory Mechanism
Experiments and Results
Baseline Models
Long Sequence Modeling
Speech Classification
Language Modeling
Analysis
How much attention is needed for different sequence modeling tasks?
...and 28 more sections

Key Result

Theorem 1

For any $\mathcal{L}' \subseteq \mathcal{L} =\text{span} \{ f_1^l,...,f_M^l \}$, there exists a pair of $(\mathbf{a}'_t$,$\mathbf{c}'_t)$ that $\mathcal{L}_{\text{SMA}}(\mathbf{a}'_t,\mathbf{c}'_t) = \mathcal{L'}$. In other words, SMA has a full coverage of the function space $\mathcal{L}$.

Figures (6)

Figure 1: The proposed SeqBoat layer. The black lines indicate that the gradients can be back-propagated and the red lines stand for gradients blocking. $\odot$ means the element-wise multiplication, and $\oplus$ is the point-wise addition. The max, argmax and softmax operators are all applied to the projected dimension after the linear layer in the latent configurator block. The sparse activation operators are respectively instantiated as compress and extract operators for parallel processing.
Figure 2: The proposed parallel implementation of the Sparse Modular Activation (SMA) mechanism. In this example, we have two modules $f_1^l,f_2^l$ and an input sequence $\mathbf{H} =\{\mathbf{h}_1, \ldots, \mathbf{h}_4\}$. We assume only $a_2^1, a_4^1, a_4^2$ have values equal to one. The white block means a zero vector. The compressed sequences for modules $f_1^l$ and $f_2^l$ are $\mathbf{H}^c_1 = \{\mathbf{h}_2, \mathbf{h}_4\}$ and $\mathbf{H}^c_2 = \{\mathbf{h}_4\}$ respectively. The final outputs are aggregated as $\mathbf{y}_4 = c_4^1 \mathbf{y}_4^1 + c_4^2 \mathbf{y}_4^2$, and $\mathbf{y}_2 = c_2^1 \mathbf{y}_2^1.$
Figure 3: The activation time (with error bars) of the GAU module at different layers of the SeqBoat-full model for different tasks in the LRA benchmark.
Figure 4: The confidence probabilities of the GAU modular activation at each time step for the last two layers of the SeqBoat-full and the SeqBoat model. The results are measured on three input sequences randomly sampled from the validation set of the Pathfinder task. The sequences are reshaped back to $32\times32$ squares for better visualization. Darker the blue color, higher the confidence. The white blocks indicate the time steps when GAUs are not activated.
Figure 5: Training Speed v.s. Validation Accuracy trade-off on Image and Pathfinder of the LRA benchmark for different models with varying memory/chunk sizes. SeqBoat keeps a working memory of the compressed sequence, while MEGA-chunk splits the input sequence into non-overlapping sub-sequences. The memory/chunk sizes are marked along the lines. The GPU-hours for Image are measured on NVIDIA RTX A5000 GPUs, and Pathfinder on V100 GPUs with 32GB memory.
...and 1 more figures

Theorems & Definitions (2)

Theorem 1: Function Space Coverage of SMA
Theorem 2: Transfer Theorem for Latent Configurator

Sparse Modular Activation for Efficient Sequence Modeling

TL;DR

Abstract

Sparse Modular Activation for Efficient Sequence Modeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)