Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

Wenlong Mou

Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

Wenlong Mou

TL;DR

The paper tackles value-function estimation for continuous-time diffusion processes using a single ergodic trajectory. It develops a non-asymptotic analysis of the LSTD estimator under a projected, discretized Bellman framework, with error measured in the first-order Sobolev norm. By exploiting ellipticity, the authors achieve an $O(1/\sqrt{T})$ convergence rate that does not deteriorate with the horizon, while revealing a nuanced trade-off between approximation error and statistical error: the Markovian portion scales with the approximation error, and the martingale portion grows sub-linearly with the number of basis functions. The results are illustrated via a Fourier-based example on the torus and are argued to extend to well-structured function classes, providing practical guidance on basis selection and trajectory-length requirements for continuous-time policy evaluation.

Abstract

We study the estimation of the value function for continuous-time Markov diffusion processes using a single, discretely observed ergodic trajectory. Our work provides non-asymptotic statistical guarantees for the least-squares temporal-difference (LSTD) method, with performance measured in the first-order Sobolev norm. Specifically, the estimator attains an $O(1 / \sqrt{T})$ convergence rate when using a trajectory of length $T$; notably, this rate is achieved as long as $T$ scales nearly linearly with both the mixing time of the diffusion and the number of basis functions employed. A key insight of our approach is that the ellipticity inherent in the diffusion process ensures robust performance even as the effective horizon diverges to infinity. Moreover, we demonstrate that the Markovian component of the statistical error can be controlled by the approximation error, while the martingale component grows at a slower rate relative to the number of basis functions. By carefully balancing these two sources of error, our analysis reveals novel trade-offs between approximation and statistical errors.

Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

TL;DR

convergence rate that does not deteriorate with the horizon, while revealing a nuanced trade-off between approximation error and statistical error: the Markovian portion scales with the approximation error, and the martingale portion grows sub-linearly with the number of basis functions. The results are illustrated via a Fourier-based example on the torus and are argued to extend to well-structured function classes, providing practical guidance on basis selection and trajectory-length requirements for continuous-time policy evaluation.

Abstract

convergence rate when using a trajectory of length

; notably, this rate is achieved as long as

scales nearly linearly with both the mixing time of the diffusion and the number of basis functions employed. A key insight of our approach is that the ellipticity inherent in the diffusion process ensures robust performance even as the effective horizon diverges to infinity. Moreover, we demonstrate that the Markovian component of the statistical error can be controlled by the approximation error, while the martingale component grows at a slower rate relative to the number of basis functions. By carefully balancing these two sources of error, our analysis reveals novel trade-offs between approximation and statistical errors.

Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

TL;DR

Abstract

Statistical guarantees for continuous-time policy evaluation: blessing of ellipticity and new tradeoffs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (21)