Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
Tal Lancewicki, Yishay Mansour
TL;DR
This work addresses online finite-horizon MDPs with adversarial, horizon-dependent losses under aggregate (full-bandit) feedback. It introduces Policy Optimization methods based on a novel $U$-function decomposition, enabling updates with closed-form expressions using only the total trajectory loss. In the known-dynamics setting, it achieves a near-optimal regret of $R_K = \tilde{O}(H^2 \sqrt{S A K})$, and in the unknown-dynamics setting a significantly improved bound of $R_K = \tilde{O}(H^3 S \sqrt{A K})$ (with a matching high-probability guarantee). A new lower bound $\Omega(H^2 \sqrt{S A K})$ is established, and the results match or surpass prior approaches that relied on reductions to distorted linear bandits, while offering a computationally efficient, strictly policy-optimization-based algorithm. The work also lays groundwork for extending the $U$-function approach to function-approximation settings and potentially to linear MDPs.
Abstract
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first \textit{optimal} regret bound of $\tilde Θ(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.
