Table of Contents
Fetching ...

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

Klemens Iten, Bruce Lee, Chenhao Li, Lenart Treven, Andreas Krause, Bhavya Sukhija

Abstract

Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

Abstract

Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.

Paper Structure

This paper contains 41 sections, 4 theorems, 58 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Assume the dynamics satisfy the RKHS regularity assumptions of sec:problem. Consider the GP mean and variance estimates $(\mu_{m:n-1,j}, \sigma_{m:n-1,j})$ with either the full resetting mechanism $m=n_0(n)$ or the sliding window $m=n-w$. It holds with probability at least $1-\delta$ that for any ep where the confidence parameter is given by and the coefficient on the drift term is $\blacktriang

Figures (5)

  • Figure E1: Learning curves for the setting with GP dynamics on the Pendulum environment. We report the mean cumulative regret $R_N$ over 5 random seeds. At episode $N=10$, we induce a change in dynamics by limiting the maximum applicable action ${\bm{u}}_{N,t}$ to half its original value over time. This leads to a linear regret for the stationary SOMBRL baseline, while R-OMBRL and SW-OMBRL adapt to the change in dynamics.
  • Figure G1: Dynamic regret under time-varying dynamics for different decay rates. The stationary baseline SOMBRL accumulates large regret due to stale data, while R-OMBRL and SW-OMBRL improve tracking by restricting the data buffer. We report the mean regret compared to an estimate of the optimal performance over five seeds with standard error.
  • Figure G2: Dynamic regret across multiple environments. The top row shows the evolution of the maximum admissible torque $\bar{u}_n$ and its decay over environment training steps, while the bottom rows report the cumulative regret $R_n$ averaged over five seeds with standard error. Under stationary dynamics, all methods perform similarly. After the onset of non-stationarity, R-OMBRL and SW-OMBRL significantly reduce regret compared to the stationary baseline.
  • Figure G3: Hardware experiments on a real RC car. The task is a parking maneuver that transitions from drift-based behavior to standard parking as the maximum throttle decays at $N=14$ episodes. The top row shows rollouts after $N=30$, where R-OMBRL successfully completes the task while SOMBRL fails to adapt. The bottom row shows the mean return and error bands along trajectories ${\bm{\tau}}_N$ on the real system over episodes, averaged over 6 random seeds. Restricting the data buffer enables adaptation to changing dynamics and improves performance compared to the stationary baseline.
  • Figure K1: Ablation of the soft reset parameter $\alpha_1=\alpha_2=\alpha$ from \ref{['eq:soft_reset']} for R-OMBRL under non-stationary dynamics. We vary the strength of the parameter perturbations applied to the dynamics model and policy after replay-buffer resets. Moderate perturbations yield the best trade-off between stability and plasticity, while aggressive policy perturbations are particularly detrimental.

Theorems & Definitions (7)

  • Lemma 1: Lemmas 1 and 2 of zhou2019no adapted for vector function ${\bm{f}}^*$
  • Lemma 2: Performance difference bound
  • Theorem 1: Regret bound
  • proof : Proof of \ref{['lem:performance_difference']}
  • Lemma 3: Episodic regret bound
  • proof
  • proof : Proof of \ref{['th:main_regret']}