Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation
Runze Zhao, Yue Yu, Adams Yiyue Zhu, Chen Yang, Dongruo Zhou
TL;DR
This work tackles the theoretical and practical inefficiencies of continuous-time reinforcement learning (CTRL) when general function approximation is used. It introduces Policy Update and Rolling-out Efficient CTRL (PURE), a model-based CTRL framework that employs optimism-based confidence sets for the dynamics and reward and delivers the first finite-sample guarantees for CTRL with general function classes. The authors further propose two efficiency-focused variants, PURE_LowSwitch and PURE_LowRollout, that dramatically reduce policy updates and rollouts with only modest trade-offs in suboptimality, quantified in terms of the distributional Eluder dimensions d_{\mathcal{R}} and d_{\mathcal{F}} and a measurement-independence coefficient. The theory is complemented by experiments on diffusion-model fine-tuning and standard continuous-time control tasks, where PURE achieves competitive performance with substantially fewer updates and faster training times. Overall, PURE provides a principled approach to balancing sample efficiency and computational cost in CTRL with rich function approximation, advancing both theory and practice.
Abstract
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. We implemented experiments to backup our proposed algorithms on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.
