Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off
Zichen Zhang, Johannes Kirschner, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, Dale Schuurmans
TL;DR
This work tackles the problem of data-efficient policy evaluation for continuous-time systems by analyzing Monte-Carlo evaluation in stochastic LQR/Langevin settings. The authors derive a closed-form mean-squared error surface that decomposes into approximation (discretization) and estimation (variance) components, showing that finer time steps reduce model error but increase variance under a fixed data budget, yielding an optimal sampling step $h^*$ that scales with the budget $B$. They extend the analysis from a one-dimensional Langevin process to multi-dimensional vector cases and both finite- and infinite-horizon objectives, including discounted settings, establishing scaling laws $h^*(B)\sim B^{-1/3}$ (finite horizon) and $h^*(B)\sim B^{-1/5}$ (infinite horizon). Numerical experiments on linear and nonlinear dynamical systems, including MuJoCo benchmarks, validate the theory and demonstrate practical guidelines for choosing sampling frequencies to improve data efficiency. The results have direct implications for RL practice, suggesting that practitioners should adapt temporal resolution to available data rather than rely on a fixed step-size. Extensions to policy optimization, adaptive sampling, and broader noise models are promising directions for future work.
Abstract
A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.
