Sample-efficient and Scalable Exploration in Continuous-Time RL
Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause
TL;DR
The paper addresses learning in unknown continuous-time dynamical systems by introducing COMBRL, a probabilistic, optimistic model-based RL framework that exploits epistemic uncertainty to drive exploration directly in continuous time. By balancing extrinsic rewards and model uncertainty with a single scalar λ_n, COMBRL achieves sublinear regret in reward-driven tasks and provides sample-complexity bounds for unsupervised dynamics learning, all while remaining agnostic to the underlying uncertainty model (GPs, BNNS, ensembles). Theoretical results are complemented by extensive empirical evaluations on continuous-time benchmarks, showing improved sample efficiency and scalability compared to baselines and state-of-the-art continuous-time methods, including time-adaptive TaCoS scenarios. The approach enables robust generalization to unseen tasks and supports both task-directed exploration and broad global exploration, making it practically impactful for real-world continuous-time control and system identification.
Abstract
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
