Table of Contents
Fetching ...

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

Yukun Zhang, Xueqing Zhou

TL;DR

This work reframes sequence modeling as the gradient flow of a variational energy, introducing a diffusion term missing in standard Transformers to impose local smoothing. It presents a learnable Adaptive PDE Diffusion Layer with linear-time complexity and a multi-scale diffusion strategy, establishing a principled alignment between local geometric smoothing and global self-attention. The theory maps diffusion to a dedicated PDE layer, reaction to FFN, and nonlocal coupling to attention, while stability and spectral analyses motivate placement after the embedding layer for optimal information retention. Empirically, the approach yields a 4.1 percentage point improvement on Long Range Arena over strong baselines, with further gains from multi-scale diffusion, demonstrating effective augmentation of long-range modeling by combining local PDE smoothing with global attention.

Abstract

We propose PDE-Transformer, a novel sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer-an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention's global routing. Through a systematic theoretical analysis based on four pillars:stability, diffusion geometry, multi-scale dynamics, and component coupling, we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

TL;DR

This work reframes sequence modeling as the gradient flow of a variational energy, introducing a diffusion term missing in standard Transformers to impose local smoothing. It presents a learnable Adaptive PDE Diffusion Layer with linear-time complexity and a multi-scale diffusion strategy, establishing a principled alignment between local geometric smoothing and global self-attention. The theory maps diffusion to a dedicated PDE layer, reaction to FFN, and nonlocal coupling to attention, while stability and spectral analyses motivate placement after the embedding layer for optimal information retention. Empirically, the approach yields a 4.1 percentage point improvement on Long Range Arena over strong baselines, with further gains from multi-scale diffusion, demonstrating effective augmentation of long-range modeling by combining local PDE smoothing with global attention.

Abstract

We propose PDE-Transformer, a novel sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer-an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention's global routing. Through a systematic theoretical analysis based on four pillars:stability, diffusion geometry, multi-scale dynamics, and component coupling, we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.

Paper Structure

This paper contains 45 sections, 9 theorems, 21 equations, 10 figures, 16 tables.

Key Result

Theorem 3.1

The gradient flow of $E[u]$ is

Figures (10)

  • Figure 1: Theory framework of PDE-Transformer. Two stacked panels show (a) the unified variational formulation and corresponding PDE, and (b) the architectural mapping that highlights the missing diffusion component in standard Transformers.
  • Figure 2: Frequency domain analysis of the multi-scale diffusion mechanism. (Left) The transfer function $H(\omega)$ for different diffusion scales. Single-scale diffusions (Fast, Medium, Slow) act as low-pass filters with different cutoff frequencies. (Right) The energy distribution across four frequency bands. The multi-scale approach (green, dashed line) achieves a more balanced energy distribution across the entire frequency spectrum compared to any single-scale method, enabling it to capture a richer set of signal components from both global trends (low frequency) and local details (high frequency).
  • Figure 3: Overall performance comparison on the LRA benchmark. Error bars show standard deviation across five runs ($n{=}5$).
  • Figure 4: Detailed results of the multi-scale PDE ablation study on the ListOps task, broken down by PDE position.
  • Figure 5: Seven PDE integration positions in Transformer architecture. (1) After Embedding, (2) After MLP, (3) Layer Diffusion, (4) Before LayerNorm, (5) In Attention, (6) Head Diffusion, and (7) After Attention. The performance analysis (right) shows that inserting the PDE diffusion layer after the embedding layer yields the largest improvement (+4.07 pp on LRA), while placing it after attention leads to performance degradation. Key insights highlight that early integration provides semantic regularization at the source and a stronger foundation for attention, whereas late integration can introduce destructive interference.
  • ...and 5 more figures

Theorems & Definitions (14)

  • Theorem 3.1: Unified Dynamical Equation
  • Theorem 3.2: Spectrum
  • Corollary 3.3: CFL Condition
  • Theorem 3.4: Lyapunov Monotonicity
  • Theorem A.1: Energy Monotonicity and Global Stability
  • proof
  • Theorem A.2: Exponential Decay of the Gradient Norm
  • proof
  • Theorem A.3: Polynomial Decay of the Heat Kernel
  • proof
  • ...and 4 more