Table of Contents
Fetching ...

Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics

Runzhe Wu, Ayush Sekhari, Akshay Krishnamurthy, Wen Sun

TL;DR

This work addresses computationally efficient online reinforcement learning under linear Bellman completeness with deterministic dynamics. It introduces a randomized least-squares approach where exploration noise is confined to the null space of the data and paired with a span-based analysis to bound regret, enabling learning in large action spaces and under stochastic rewards/initial states. The authors prove regret bounds of the form $ ilde{O}(d^{5/2}H^{5/2} + d^2H^{3/2}\sqrt{T})$ under exact or approximate square-loss oracles and extend to scenarios with low inherent Bellman error, detailing oracle-implementation strategies via convex-set feasibility and linear optimization. This work narrows the statistical-computational gap for linear Bellman complete RL and furnishes practical algorithms grounded in convex optimization with provable guarantees, while leaving extensions to stochastic dynamics as an open problem.

Abstract

We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting. This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least squares regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.

Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics

TL;DR

This work addresses computationally efficient online reinforcement learning under linear Bellman completeness with deterministic dynamics. It introduces a randomized least-squares approach where exploration noise is confined to the null space of the data and paired with a span-based analysis to bound regret, enabling learning in large action spaces and under stochastic rewards/initial states. The authors prove regret bounds of the form under exact or approximate square-loss oracles and extend to scenarios with low inherent Bellman error, detailing oracle-implementation strategies via convex-set feasibility and linear optimization. This work narrows the statistical-computational gap for linear Bellman complete RL and furnishes practical algorithms grounded in convex optimization with provable guarantees, while leaving extensions to stochastic dynamics as an open problem.

Abstract

We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting. This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least squares regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.
Paper Structure (30 sections, 35 theorems, 141 equations, 1 table, 5 algorithms)

This paper contains 30 sections, 35 theorems, 141 equations, 1 table, 5 algorithms.

Key Result

Theorem 1

Under asm:exact-oracleasm:determin, executing alg:main with parameters $\sigma_{{\rm{R}}} = \widetilde{\Theta}(\sqrt{d H \log(HT)})$ and $\sigma_{h} = \widetilde{\Theta}( (d\sqrt{mH})^{H-h+1}(\sqrt{d} + \sqrt{mH}) )$, we have

Theorems & Definitions (67)

  • Definition 1: Linear Bellman Completeness
  • Example 1: Arbitrarily Large $\ell_2$-norm on Parameters
  • Example 2: Expansiveness of Bellman Backup in $\ell_2$-norm
  • Definition 2: D-optimal design
  • Theorem 1: Regret Bound with Exact Oracle
  • Corollary 1: Sample Complexity Bound
  • Theorem 2: Regret Bound with Approximate Oracle
  • Definition 3: Inherent Linear Bellman Error
  • Theorem 3: Regret Bound with Low Inherent Bellman Error
  • Definition 4: Separation oracle
  • ...and 57 more