Table of Contents
Fetching ...

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Avik Kar, Rahul Singh

TL;DR

This work tackles infinite-horizon average-reward RL in continuous state-action Lipschitz MDPs by introducing policy zooming, a framework that adaptively discretizes the policy space using a new policy-focused zooming dimension $d^ ext{Φ}_z$ to capture problem complexity. It presents two scalable algorithms, PZRL-MF (model-free) and PZRL-MB (model-based), achieving high-probability regret bounds that scale with an effective dimension $d_{ ext{eff}}$, where $d_{ ext{eff}}=d^ ext{Φ}_z+2$ for the model-free case and $d_{ ext{eff}}=2d_{ ext S}+d^ ext{Φ}_z+3$ for the model-based case. Under finite-dimensional parameterizations or bi-Lipschitz average-reward assumptions, these bounds improve to $ ilde{O}(\, ext{sqrt}(T))$, highlighting substantial adaptivity gains when the comparator policy class is simple. A novel sensitivity result bounds the distance between stationary distributions under policy changes, strengthening the theoretical foundation for policy-class regret in continuous spaces. Empirical validation on transmission scheduling and continuous RiverSwim demonstrates practical benefits of the adaptive zooming approach over uniform discretization and existing baselines, confirming the framework’s relevance for real-world continuous RL tasks.

Abstract

We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set $Φ$. The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of $Φ$, thereby achieving adaptivity gains in the performance. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = d^Φ_z+2$ for model-free algoritahm $\textit{PZRL-MF}$ and $d_{\text{eff.}} = 2d_\mathcal{S} + d^Φ_z + 3$ for model-based algorithm $\textit{PZRL-MB}$. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^Φ_z$ is the zooming dimension given a set of policies $Φ$. $d^Φ_z$ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on $Φ$. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity $Φ$ (that has a small $d^Φ_z$). When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 2$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for $\textit{PZRL-MF}$, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

TL;DR

This work tackles infinite-horizon average-reward RL in continuous state-action Lipschitz MDPs by introducing policy zooming, a framework that adaptively discretizes the policy space using a new policy-focused zooming dimension to capture problem complexity. It presents two scalable algorithms, PZRL-MF (model-free) and PZRL-MB (model-based), achieving high-probability regret bounds that scale with an effective dimension , where for the model-free case and for the model-based case. Under finite-dimensional parameterizations or bi-Lipschitz average-reward assumptions, these bounds improve to , highlighting substantial adaptivity gains when the comparator policy class is simple. A novel sensitivity result bounds the distance between stationary distributions under policy changes, strengthening the theoretical foundation for policy-class regret in continuous spaces. Empirical validation on transmission scheduling and continuous RiverSwim demonstrates practical benefits of the adaptive zooming approach over uniform discretization and existing baselines, confirming the framework’s relevance for real-world continuous RL tasks.

Abstract

We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set . The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of , thereby achieving adaptivity gains in the performance. We upper bound their regret as , where for model-free algoritahm and for model-based algorithm . Here, is the dimension of the state space, and is the zooming dimension given a set of policies . is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on . Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity (that has a small ). When specialized to the case of finite-dimensional policy space, we obtain that scales as the dimension of this space under mild technical conditions; and also obtain , or equivalently regret for , under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.
Paper Structure (32 sections, 41 theorems, 200 equations, 3 figures, 2 algorithms)

This paper contains 32 sections, 41 theorems, 200 equations, 3 figures, 2 algorithms.

Key Result

Theorem 4.1

Let the MDP $\mathcal{M}$ satisfy Assumptions assum:lip and assum:unif_ergodic. (i) Then, the infinite horizon average reward is $L_{J,\infty}$-Lipschitz w.r.t. the metric $\rho_{\Phi,\infty}$, i.e., for $\phi_1, \phi_2 \in \Phi$ we have, where, (ii) Furthermore, if $\mu^{(\infty)}_{\phi,p}(\xi) \leq \kappa \nu(\xi),~\forall \xi \in \mathcal{B}_\mathcal{S}, \phi \in \Phi$, for some probability

Figures (3)

  • Figure 1: Relations among families of continuous space RL problems. LQR stands for Linear Quadratic Regulator abbasi2011regret. Our assumptions correspond to the green set. Diagram is taken from maran2024no, see maran2024no for more details.
  • Figure 2: We show the policies activated by different algorithms for one single trajectory of the transmission scheduling example (See Section \ref{['sec:sim']}). The radius of the balls around an active policy is proportional to its average reward. Uniform discretization-based algorithms waste resources to learn a larger number of policies, whereas adaptive algorithms activate more policies from the near-optimal regions.
  • Figure :

Theorems & Definitions (85)

  • Remark 2.3
  • Remark 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.5
  • proof : Proof sketch
  • Remark 4.6: Discontinuous Policies
  • Remark 4.7: Regarding Assumption \ref{['assum:stn_dist']}
  • Corollary 4.8: Finite parameterization
  • Corollary 4.9: Bi-Lipschitz MDPs
  • ...and 75 more