Table of Contents
Fetching ...

Linear Diffusion Networks

Jacob Fein-Ashley

TL;DR

LDN tackles the bottleneck of sequential modeling by reframing temporal information sharing as a diffusion process. It integrates a PDE-inspired primary diffusion kernel $K$, a local update $F$, and a diffusion-based attention kernel $A_{\text{diff}}$, with an adaptive time step $\delta t$, to enable stable, parallelizable, multi-scale sequence processing. The approach provides global interactions with rigorous row-sum-zero constraints, yielding robust training and strong empirical results on ImageNet and Long Range Arena, often with fewer parameters and FLOPs than competitive transformers. This diffusion-centric framework bridges efficient computation and expressive representation learning, offering a versatile path for sequential modeling in both vision and language domains.

Abstract

We present Linear Diffusion Networks (LDNs), a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. LDN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that LDN delivers competitive performance across ImageNet and LRA tasks.

Linear Diffusion Networks

TL;DR

LDN tackles the bottleneck of sequential modeling by reframing temporal information sharing as a diffusion process. It integrates a PDE-inspired primary diffusion kernel , a local update , and a diffusion-based attention kernel , with an adaptive time step , to enable stable, parallelizable, multi-scale sequence processing. The approach provides global interactions with rigorous row-sum-zero constraints, yielding robust training and strong empirical results on ImageNet and Long Range Arena, often with fewer parameters and FLOPs than competitive transformers. This diffusion-centric framework bridges efficient computation and expressive representation learning, offering a versatile path for sequential modeling in both vision and language domains.

Abstract

We present Linear Diffusion Networks (LDNs), a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. LDN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that LDN delivers competitive performance across ImageNet and LRA tasks.

Paper Structure

This paper contains 57 sections, 1 theorem, 28 equations, 2 figures, 2 tables.

Key Result

Theorem 1

Assume: Then, there exists a positive integer $L$ (dependent on the structure of $K$) such that every entry of the effective diffusion operator satisfies: Equivalently, every output token $h_i^{(L)}$ depends on every input token $h_j^{(0)}$: with Thus, the diffusion process inherently captures global dependencies.

Figures (2)

  • Figure 1: Matrix-Form Illustration. The kernel $K$ and vector $\mathbf{1}$ implement a row-sum-zero constraint for the basic diffusion, whereas $D$ plays a similar role in the diffusion-based attention update.
  • Figure 2: LDN Overview. Each layer combines three modules---Diffusion, Local Update, and Diffusion-Based Attention---in a parallelizable, stable manner. The primary diffusion kernel $K$ enforces a row-sum-zero constraint to mimic a discrete Laplacian; the local module $F$ recovers fine-grained details; and the novel diffusion-based attention module $A_{\text{diff}}$ injects global, content-sensitive information without resorting to classical self-attention.

Theorems & Definitions (2)

  • Theorem 1: Global Dependency Theorem
  • proof