Table of Contents
Fetching ...

A Model Selection Approach for Corruption Robust Reinforcement Learning

Chen-Yu Wei, Christoph Dann, Julian Zimmert

TL;DR

This work tackles reinforcement learning under adversarial corruption in both rewards and transitions by introducing a model-selection (corralling) framework that adapts to unknown total corruption $C$. The core idea is to run a suite of base corruption-robust algorithms, each with a different hypothesized corruption level, and to eliminate mis-specified models through a principled testing procedure, achieving a worst-case regret of $\tilde{O}(\sqrt{T}+C)$ in tabular MDPs and extending to $\tilde{O}(\sqrt{(1+C)T})$ in finite-horizon linear MDPs (with a computationally inefficient variant also available). The framework further yields gap-dependent bounds via the G-COBE construction, offering $\tilde{O}\left(\min\{\sqrt{\beta_4 T}, \beta_4/\Delta\}\!+\!\beta_2C\!+\!\beta_4\right)$ and achieving the best-of-both-worlds bound $\min\{1/\Delta, \sqrt{T}\}+C$ without knowledge of $C$. Extensions to linear contextual bandits and general function approximation (via GOLF) demonstrate broad applicability, with concrete corruption-robust algorithms and performance guarantees across settings. Overall, the paper provides a unifying framework for robust RL under corruption, delivering new optimal guarantees and practical algorithmic strategies across tabular, linear, and function-approximation regimes.

Abstract

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of $\widetilde{\mathcal{O}}(\min\{\frac{1}Δ, \sqrt{T}\}+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $Δ$ is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of $C$, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{(1+C)T})$, and another computationally inefficient one with $\widetilde{\mathcal{O}}(\sqrt{T}+C)$, improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

A Model Selection Approach for Corruption Robust Reinforcement Learning

TL;DR

This work tackles reinforcement learning under adversarial corruption in both rewards and transitions by introducing a model-selection (corralling) framework that adapts to unknown total corruption . The core idea is to run a suite of base corruption-robust algorithms, each with a different hypothesized corruption level, and to eliminate mis-specified models through a principled testing procedure, achieving a worst-case regret of in tabular MDPs and extending to in finite-horizon linear MDPs (with a computationally inefficient variant also available). The framework further yields gap-dependent bounds via the G-COBE construction, offering and achieving the best-of-both-worlds bound without knowledge of . Extensions to linear contextual bandits and general function approximation (via GOLF) demonstrate broad applicability, with concrete corruption-robust algorithms and performance guarantees across settings. Overall, the paper provides a unifying framework for robust RL under corruption, delivering new optimal guarantees and practical algorithmic strategies across tabular, linear, and function-approximation regimes.

Abstract

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of where is the number of episodes, is the total amount of corruption, and is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of , improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of , and another computationally inefficient one with , improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

Paper Structure

This paper contains 46 sections, 38 theorems, 145 equations, 1 table, 8 algorithms.

Key Result

lemma 1

With probability at least $1-\mathcal{O}(k_{\max}\delta)$, the termination condition eq: terminate condition 1 of the BASIC algorithm, does not hold in any round $t$, such that $C_{t}\leq 2^k$.

Theorems & Definitions (79)

  • lemma 1
  • lemma 2
  • theorem 1
  • theorem 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7: Freedman's inequality, Theorem 1 of beygelzimer2011contextual
  • lemma 8: Freedman's inequality, Lemma 4.4 of bubeck2012best
  • ...and 69 more