Online Learning in MDPs with Partially Adversarial Transitions and Losses
Ofir Schlisselberg, Tal Lancewicki, Yishay Mansour
TL;DR
The paper addresses online learning in MDPs with partially adversarial transitions, introducing conditioned occupancy measures (COM) to stabilize occupancy across episodes despite adversarial steps. It develops two COM-based algorithm families—action-based and sub-policy based—achieving regret bounds that scale with the number of adversarial steps $\Lambda$ (e.g., $\tilde{O}(H S^{\Lambda}\sqrt{K S A^{\Lambda+1}})$ and $\tilde{O}(H \sqrt{K S^{3} A^{\Lambda+1}})$) while preserving tractability under certain structural assumptions. It also provides a reduction to remove the need to know which steps are adversarial, incurring a $K^{2/3}$ additive term, and a complete characterization of regret in fully adversarial MDPs under various feedback models, including matching lower bounds for both full-information and bandit settings. The results yield a refined landscape where regret remains manageable when adversarial influence is limited to a small, fixed subset of steps, bridging classical stationary and fully adversarial regimes with practical implications for robust RL in structured non-stationary environments.
Abstract
We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $Λ$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^Λ\sqrt{K S A^{Λ+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{Λ+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $Λ$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($Λ=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).
