Table of Contents
Fetching ...

Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

Chao Han, Stefanos Ioannou, Luca Manneschi, T. J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki

Abstract

We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan-UoS/NeuralRL

Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

Abstract

We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan-UoS/NeuralRL
Paper Structure (33 sections, 19 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 19 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Computational graph of the latent SDE. (a) The RNN encoder and ODE decoder from the latent ODE. The encoder is only utilized during the model training. In the inference (generation) stage, the initial latent $z_0$ is sampled from the standard Gaussian prior distribution instead. Note that the $\tilde{o}_{i}$ could denote either the real observation $o_i$ or predicted one $\hat{o}_i$ during training, depending on whether teacher forcing strategy is used, yet it is always set to be $\tilde{o}_{i} = \hat{o}_i$ during inference. (b) The SDE generator and MLP critic of the latent SDE. In the SDE solver, we use the learnt drift function $f_{\bar{\theta}_\mu}$ from the ODE decoder in (a) and train the diffusion function $g_{\theta_\sigma}$ to capture the full stochasticity in the latent space. In both panels, the deterministic variables are represented as diamonds while the stochastic variables are depicted as circles.
  • Figure 2: Overview of the architecture for training and deploying a policy adapted from the source domain in the target domain. (a) Given an off-the-shelf source policy $\pi^\text{src}$ and transition model $\mathcal{T}_\theta^\text{src}$, we use an inverse dynamics model $I_\eta$ to generate the target action $a_t^\text{tge}$ leading to the next target state $\hat{s}_{t+1}^\text{tge}$. $I_\eta$ is trained via minimizing the mismatch between the predicted next state $\hat{s}_{t+1}^\text{tge}$ and the desired state $\hat{s}_{t+1}^\text{src}$. (b) The trained inverse model, combined with the source policy and transition, is deployed in the real target environment as an adapted target policy.
  • Figure 3: Comparison between neural ODE and SDE on stochastic cartpole. (a) Predicted histograms of cart velocity at 16%, 37%, 58%, 79%, 100% of the total timesteps by the neural ODE and SDE against the real histograms. (b) 50 sample paths from the distributions predicted by neural ODE and SDE against 50 paths from the real distributions. The neural SDE learns to match the real distributions and sample paths better than the neural ODE. (c) Policies trained in the approximate transition dynamics of neural ODE (N-ODE-based) and SDE (N-SDE-based), as well as in real transition (model-free), are evaluated in the real environment. N-SDE-based policy achieves similar performance to the model-free policy (used as the oracle here).
  • Figure 4: Evaluated performance of different policies in target cartpole environments with increasing pole lengths (source environment pole length = 1.0). Shaded regions denote the standard deviation of returns over 4 runs. (a) Under deterministic source and target dynamics, for pole lengths between 1.8 and 3.2, both the N-ODE–adapted and ensemble-adapted policies consistently outperform the original non-adapted policy, and both substantially surpass the trained-from-scratch policy. The N-ODE–adapted and ensemble-adapted policies show nearly identical performance. (b) Under stochastic source and target dynamics, obtained by adding zero-mean Gaussian noise to cart velocity, for pole lengths between 2.6 and 3.8, the ODE-adapted, SDE-drift–adapted, and ensemble-adapted policies achieve similarly strong performance, with the ensemble-adapted policy showing occasional drops at certain pole lengths. These are followed by the non-adapted policy, with the trained-from-scratch policy performing worst.
  • Figure 5: Learning curves of model-based and model-free policies in fully and partially observable stochastic Mujuco environments. Shaded regions indicate the standard deviation of evaluated returns over 5 runs, with evaluations conducted every 5k environment steps. The policy derived from the latent SDE model consistently achieves the best asymptotic performance across all environments. SDE-based policies also exhibit greater sample efficiency than the model-free baseline in more complex environments.
  • ...and 4 more figures