Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics
Runzhe Wu, Ayush Sekhari, Akshay Krishnamurthy, Wen Sun
TL;DR
This work addresses computationally efficient online reinforcement learning under linear Bellman completeness with deterministic dynamics. It introduces a randomized least-squares approach where exploration noise is confined to the null space of the data and paired with a span-based analysis to bound regret, enabling learning in large action spaces and under stochastic rewards/initial states. The authors prove regret bounds of the form $ ilde{O}(d^{5/2}H^{5/2} + d^2H^{3/2}\sqrt{T})$ under exact or approximate square-loss oracles and extend to scenarios with low inherent Bellman error, detailing oracle-implementation strategies via convex-set feasibility and linear optimization. This work narrows the statistical-computational gap for linear Bellman complete RL and furnishes practical algorithms grounded in convex optimization with provable guarantees, while leaving extensions to stochastic dynamics as an open problem.
Abstract
We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting. This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least squares regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.
