A Model Selection Approach for Corruption Robust Reinforcement Learning

Chen-Yu Wei; Christoph Dann; Julian Zimmert

A Model Selection Approach for Corruption Robust Reinforcement Learning

Chen-Yu Wei, Christoph Dann, Julian Zimmert

TL;DR

This work tackles reinforcement learning under adversarial corruption in both rewards and transitions by introducing a model-selection (corralling) framework that adapts to unknown total corruption $C$. The core idea is to run a suite of base corruption-robust algorithms, each with a different hypothesized corruption level, and to eliminate mis-specified models through a principled testing procedure, achieving a worst-case regret of $\tilde{O}(\sqrt{T}+C)$ in tabular MDPs and extending to $\tilde{O}(\sqrt{(1+C)T})$ in finite-horizon linear MDPs (with a computationally inefficient variant also available). The framework further yields gap-dependent bounds via the G-COBE construction, offering $\tilde{O}\left(\min\{\sqrt{\beta_4 T}, \beta_4/\Delta\}\!+\!\beta_2C\!+\!\beta_4\right)$ and achieving the best-of-both-worlds bound $\min\{1/\Delta, \sqrt{T}\}+C$ without knowledge of $C$. Extensions to linear contextual bandits and general function approximation (via GOLF) demonstrate broad applicability, with concrete corruption-robust algorithms and performance guarantees across settings. Overall, the paper provides a unifying framework for robust RL under corruption, delivering new optimal guarantees and practical algorithmic strategies across tabular, linear, and function-approximation regimes.

Abstract

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of $\widetilde{\mathcal{O}}(\min\{\frac{1}Δ, \sqrt{T}\}+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $Δ$ is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of $C$, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{(1+C)T})$, and another computationally inefficient one with $\widetilde{\mathcal{O}}(\sqrt{T}+C)$, improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

A Model Selection Approach for Corruption Robust Reinforcement Learning

TL;DR

This work tackles reinforcement learning under adversarial corruption in both rewards and transitions by introducing a model-selection (corralling) framework that adapts to unknown total corruption

. The core idea is to run a suite of base corruption-robust algorithms, each with a different hypothesized corruption level, and to eliminate mis-specified models through a principled testing procedure, achieving a worst-case regret of

in tabular MDPs and extending to

in finite-horizon linear MDPs (with a computationally inefficient variant also available). The framework further yields gap-dependent bounds via the G-COBE construction, offering

and achieving the best-of-both-worlds bound

without knowledge of

. Extensions to linear contextual bandits and general function approximation (via GOLF) demonstrate broad applicability, with concrete corruption-robust algorithms and performance guarantees across settings. Overall, the paper provides a unifying framework for robust RL under corruption, delivering new optimal guarantees and practical algorithmic strategies across tabular, linear, and function-approximation regimes.

Abstract

where

is the number of episodes,

is the total amount of corruption, and

is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of

, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of

, and another computationally inefficient one with

, improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.

A Model Selection Approach for Corruption Robust Reinforcement Learning

TL;DR

Abstract

A Model Selection Approach for Corruption Robust Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (79)