Table of Contents
Fetching ...

Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling

Macheng Shen, Chen Cheng

TL;DR

This work presents Neural SDEs as a unified framework for continuous-time sequence modeling, treating sequences as trajectories of an underlying stochastic dynamical system and learning both drift and diffusion via maximum likelihood. By employing a time-invariant, diagonally diffusive SDE and a simulation-free training objective, the approach directly models state-to-state transitions without unrolling from a noise prior. Empirical results demonstrate accurate multi-modal trajectory generation, robustness to sharp dynamics, and efficient inference with few steps in tasks ranging from imitation learning to video prediction, with added capability for temporal interpolation. The method offers a principled bridge between continuous-time dynamics and data-driven sequence modeling, enabling scalable, interpretable modeling of complex temporal processes in embodied and generative AI settings.

Abstract

Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textit{discrete samples from an underlying continuous dynamical system}, and models its time evolution using Neural Stochastic Differential Equation (Neural SDE), where both the flow (drift) and diffusion terms are parameterized by neural networks. We derive a principled maximum likelihood objective and a \textit{simulation-free} scheme for efficient training of our Neural SDE model. We demonstrate the versatility of our approach through experiments on sequence modeling tasks across both embodied and generative AI. Notably, to the best of our knowledge, this is the first work to show that SDE-based continuous-time modeling also excels in such complex scenarios, and we hope that our work opens up new avenues for research of SDE models in high-dimensional and temporally intricate domains.

Neural SDEs as a Unified Approach to Continuous-Domain Sequence Modeling

TL;DR

This work presents Neural SDEs as a unified framework for continuous-time sequence modeling, treating sequences as trajectories of an underlying stochastic dynamical system and learning both drift and diffusion via maximum likelihood. By employing a time-invariant, diagonally diffusive SDE and a simulation-free training objective, the approach directly models state-to-state transitions without unrolling from a noise prior. Empirical results demonstrate accurate multi-modal trajectory generation, robustness to sharp dynamics, and efficient inference with few steps in tasks ranging from imitation learning to video prediction, with added capability for temporal interpolation. The method offers a principled bridge between continuous-time dynamics and data-driven sequence modeling, enabling scalable, interpretable modeling of complex temporal processes in embodied and generative AI settings.

Abstract

Inspired by the ubiquitous use of differential equations to model continuous dynamics across diverse scientific and engineering domains, we propose a novel and intuitive approach to continuous sequence modeling. Our method interprets time-series data as \textit{discrete samples from an underlying continuous dynamical system}, and models its time evolution using Neural Stochastic Differential Equation (Neural SDE), where both the flow (drift) and diffusion terms are parameterized by neural networks. We derive a principled maximum likelihood objective and a \textit{simulation-free} scheme for efficient training of our Neural SDE model. We demonstrate the versatility of our approach through experiments on sequence modeling tasks across both embodied and generative AI. Notably, to the best of our knowledge, this is the first work to show that SDE-based continuous-time modeling also excels in such complex scenarios, and we hope that our work opens up new avenues for research of SDE models in high-dimensional and temporally intricate domains.

Paper Structure

This paper contains 45 sections, 1 theorem, 40 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.1

For any scaling factor $\lambda > 0$, let: Then the scaled SDE: has Euler-Maruyama discretization statistically equivalent to the original SDE.

Figures (7)

  • Figure 1: Our approach introduces a new paradigm for continuous-domain sequence modeling by representing dynamics with SDEs, instead of directly modeling conditional densities. The Fokker-Planck equation provides the theoretical link between these two paradigms, describing the time evolution of the probability density. This framework unifies embodied and generative AI under the same continuous sequence modeling paradigm.
  • Figure 2: Trajectory generation on a Y-shape Bifurcation (multi-modal Distribution). We compare our proposed Neural SDE approach with DDIM and Rectified Flow at two different densities (number of steps per trajectory). At a lower density, all models successfully generate bi-modal trajectories. At a higher density, DDIM and Rectified Flow fail due to covariate-shift, while Neural SDEs still accurately captures both branches.
  • Figure 3: Ablation Study of the Neural SDE Components on the Y-shape Bifurcation Task (high density). We visualize the learned vector fields with different combinations of the Flow, Diffusion, and Denoiser terms. The scale of vector fields is scaled for visual clarity. The Flow term alone captures the general direction but lacks stochasticity. Adding Diffusion introduces stochasticity but fails to reach the bifurcation point accurately due to covariate-shift. The Denoiser effectively mitigates covariate-shift. As a result, the full model (Flow+Denoiser+Diffusion) accurately models the multi-modal distribution.
  • Figure 4: Non-Smooth Trajectory Generation. A Push-T trajectory generated by our Neural SDE, showcasing its ability to handle drastic changes in direction.
  • Figure 5: Inference Efficiency. The plots show the performance of Neural SDE, Flow Matching, and PFI on the KTH and CLEVRER datasets, measured by the metrics FVD, JEDI, SSIM and PSNR, with respect to the number of function evaluations (NFE). Lower FVD and JEDi and higher SSIM and PSNR indicate better performance. To control the NFEs of PFI and Neural SDE, we use fixed step sizes. All metrics are measured by generating 1024 test video sequences (randomly sampled with replacement).
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 2.1: Numerical Simulation Temporal Scale Invariance
  • proof