Table of Contents
Fetching ...

Sample-efficient and Scalable Exploration in Continuous-Time RL

Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause

TL;DR

The paper addresses learning in unknown continuous-time dynamical systems by introducing COMBRL, a probabilistic, optimistic model-based RL framework that exploits epistemic uncertainty to drive exploration directly in continuous time. By balancing extrinsic rewards and model uncertainty with a single scalar λ_n, COMBRL achieves sublinear regret in reward-driven tasks and provides sample-complexity bounds for unsupervised dynamics learning, all while remaining agnostic to the underlying uncertainty model (GPs, BNNS, ensembles). Theoretical results are complemented by extensive empirical evaluations on continuous-time benchmarks, showing improved sample efficiency and scalability compared to baselines and state-of-the-art continuous-time methods, including time-adaptive TaCoS scenarios. The approach enables robust generalization to unseen tasks and supports both task-directed exploration and broad global exploration, making it practically impactful for real-world continuous-time control and system identification.

Abstract

Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.

Sample-efficient and Scalable Exploration in Continuous-Time RL

TL;DR

The paper addresses learning in unknown continuous-time dynamical systems by introducing COMBRL, a probabilistic, optimistic model-based RL framework that exploits epistemic uncertainty to drive exploration directly in continuous time. By balancing extrinsic rewards and model uncertainty with a single scalar λ_n, COMBRL achieves sublinear regret in reward-driven tasks and provides sample-complexity bounds for unsupervised dynamics learning, all while remaining agnostic to the underlying uncertainty model (GPs, BNNS, ensembles). Theoretical results are complemented by extensive empirical evaluations on continuous-time benchmarks, showing improved sample efficiency and scalability compared to baselines and state-of-the-art continuous-time methods, including time-adaptive TaCoS scenarios. The approach enables robust generalization to unseen tasks and supports both task-directed exploration and broad global exploration, making it practically impactful for real-world continuous-time control and system identification.

Abstract

Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.

Paper Structure

This paper contains 34 sections, 5 theorems, 44 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Under regularity assumptions (Ass. ass:one to assumption: Well Calibration Assumption), we have with probability at least $1 - \delta$:

Figures (6)

  • Figure 1: GP dynamics. Learning curves for baselines, COMBRL and OCORL with fixed internal reward weight $\lambda_n$ using GP dynamics and iCEM planning, averaged over 5 seeds. We report the mean and the standard error bands. COMBRL achieves higher asymptotic returns than PETS and mean, while matching or exceeding OCORL at roughly $3\times$ lower computational cost (\ref{['app:compute times']}).
  • Figure 2: Effect of intrinsic rewards. Learning curves for COMBRL (with auto-tuned $\lambda_n$) and baselines. We report the mean return when evaluating the learned model on the task at hand, averaged across 10 random seeds along with the standard error. COMBRL achieves the largest performance gains in sparse or underactuated tasks, and consistent improvements in higher-dimensional domains.
  • Figure 3: Generalization to downstream tasks. Final returns at convergence on primary (trained) and downstream (unseen) tasks across seven Gym environments. We report the mean return as well as the standard error for the primary and downstream task. For COMBRL, we differentiate between the balanced case with a static or annealing schedule for $\lambda_n$, or the unsupervised case with $\lambda_n\rightarrow\infty$.
  • Figure 4: Ablating the internal reward weight $\lambda_n$. Final performance at convergence for different environments and tasks with varying $\lambda$. We ablate over different choices of $\lambda$ and report the mean return and standard error on a primary task which the proposed algorithm was trained on, as well as a previously unseen downstream task.
  • Figure 5: Adaptive TaCoS setting. Final performance at convergence for COMBRL-TaCoS compared to OTaCoS, Mean-TaCoS, PETS-TaCoS, and COMBRL with a fixed control rate (equidistant MSS). Final returns at convergence are averaged over 10 random seeds and reported as mean with standard error over $10$ random seeds. COMBRL-TaCoS achieves competitive or superior returns while requiring fewer interactions than its fixed-rate variant, and matches or exceeds the performance of OTaCoS.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Definition 1: Measurement selection strategy, treven_efficient_2023
  • Definition 2: Well-calibrated statistical model of ${\bm{f}}^*$, rothfuss_hallucinated_2023
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 2 more