Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning
Avik Kar, Rahul Singh
TL;DR
This work tackles infinite-horizon average-reward RL in continuous state-action Lipschitz MDPs by introducing policy zooming, a framework that adaptively discretizes the policy space using a new policy-focused zooming dimension $d^ ext{Φ}_z$ to capture problem complexity. It presents two scalable algorithms, PZRL-MF (model-free) and PZRL-MB (model-based), achieving high-probability regret bounds that scale with an effective dimension $d_{ ext{eff}}$, where $d_{ ext{eff}}=d^ ext{Φ}_z+2$ for the model-free case and $d_{ ext{eff}}=2d_{ ext S}+d^ ext{Φ}_z+3$ for the model-based case. Under finite-dimensional parameterizations or bi-Lipschitz average-reward assumptions, these bounds improve to $ ilde{O}(\, ext{sqrt}(T))$, highlighting substantial adaptivity gains when the comparator policy class is simple. A novel sensitivity result bounds the distance between stationary distributions under policy changes, strengthening the theoretical foundation for policy-class regret in continuous spaces. Empirical validation on transmission scheduling and continuous RiverSwim demonstrates practical benefits of the adaptive zooming approach over uniform discretization and existing baselines, confirming the framework’s relevance for real-world continuous RL tasks.
Abstract
We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set $Φ$. The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of $Φ$, thereby achieving adaptivity gains in the performance. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = d^Φ_z+2$ for model-free algoritahm $\textit{PZRL-MF}$ and $d_{\text{eff.}} = 2d_\mathcal{S} + d^Φ_z + 3$ for model-based algorithm $\textit{PZRL-MB}$. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^Φ_z$ is the zooming dimension given a set of policies $Φ$. $d^Φ_z$ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on $Φ$. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity $Φ$ (that has a small $d^Φ_z$). When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 2$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for $\textit{PZRL-MF}$, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.
