Table of Contents
Fetching ...

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

Benjamin Walker, Lingyi Yang, Nicola Muca Cirone, Cristopher Salvi, Terry Lyons

TL;DR

It is proved that, unlike the diagonal state-transition matrices of S4D and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh-Hadamard matrices match the maximal expressivity of dense matrices.

Abstract

This work introduces Structured Linear Controlled Differential Equations (SLiCEs), a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet's diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh-Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4D and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh-Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the $A_5$ state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.

Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

TL;DR

It is proved that, unlike the diagonal state-transition matrices of S4D and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh-Hadamard matrices match the maximal expressivity of dense matrices.

Abstract

This work introduces Structured Linear Controlled Differential Equations (SLiCEs), a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet's diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh-Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4D and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh-Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.

Paper Structure

This paper contains 36 sections, 6 theorems, 70 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

If $\max_jb_j\rightarrow\infty$ as $d_h\rightarrow\infty$, then block-diagonal SLiCEs have maximal probabilistic expressivity.

Figures (5)

  • Figure 1: Results for $A_5$ Benchmark and $A_5$ length generalisation task. Models evaluated are: Mamba, LSTM, mLSTM, sLSTM, Gated DeltaProduct with negative eigenvalues, and the SLiCEs.
  • Figure 2: Graphical representations of the matrix $\boldsymbol{W}\text{diag}\boldsymbol{(\Delta)}$. Here edges correspond to identity matrices.
  • Figure 3: The product $\boldsymbol{\frac{1}{N} \left\langle A_I h_0, A_Jh_0\right\rangle_{\mathbb{R}^N}}$ as a product graph G.
  • Figure 4: Construction of $\boldsymbol{(G_{I,J})_\phi}$. Here, we display all the pairings, represented by the red dashed lines, for $I = J = 11$ along with their intermediate stages. For simplicity, we omit the $H$ labels from edges with arrows.
  • Figure 5: Average per-step training time versus average validation accuracy across six multivariate time-series classification datasets from the UEA-MTSCA. Each point represents a model, with circle area proportional to average GPU memory usage. We compare four families of models: a recurrent neural network (LRU), SSMs (S5, S6, and Mamba), non-linear NCDEs (NCDE, NRDE, and Log-NCDE), and linear NCDEs (Diagonal SLiCE, Block-Diagonal SLiCE, and Dense LNCDE). All test accuracy results except linear NCDEs are from Walker2024LogNCDE. All timing and GPU memory results were re-performed on an NVIDIA H100 GPU.

Theorems & Definitions (14)

  • Definition 3.1: Maximal expressivity
  • Definition 3.2: Maximal Probabilistic Expressivity
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Definition B.1
  • Proposition B.2
  • proof
  • Remark B.3
  • Proposition B.4
  • ...and 4 more