PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling
Yukun Zhang, Xueqing Zhou
TL;DR
This work reframes sequence modeling as the gradient flow of a variational energy, introducing a diffusion term missing in standard Transformers to impose local smoothing. It presents a learnable Adaptive PDE Diffusion Layer with linear-time complexity and a multi-scale diffusion strategy, establishing a principled alignment between local geometric smoothing and global self-attention. The theory maps diffusion to a dedicated PDE layer, reaction to FFN, and nonlocal coupling to attention, while stability and spectral analyses motivate placement after the embedding layer for optimal information retention. Empirically, the approach yields a 4.1 percentage point improvement on Long Range Arena over strong baselines, with further gains from multi-scale diffusion, demonstrating effective augmentation of long-range modeling by combining local PDE smoothing with global attention.
Abstract
We propose PDE-Transformer, a novel sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer-an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention's global routing. Through a systematic theoretical analysis based on four pillars:stability, diffusion geometry, multi-scale dynamics, and component coupling, we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.
