Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

Chenlu Ye; Wei Xiong; Quanquan Gu; Tong Zhang

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

TL;DR

This work advances corruption-robust reinforcement learning by extending uncertainty-weighted regression techniques from linear to nonlinear function approximation. It introduces CR-Eluder-UCB for nonlinear contextual bandits and CR-LSVI-UCB for episodic MDPs, both leveraging data-dependent weights to bound weighted uncertainty and maintain optimism in the presence of adversarial corruption. The authors establish additive regret in the corruption level $\zeta$, parameterized by eluder dimension and covering numbers of the function class, and they handle unknown $\zeta$ via adaptive tuning. A key contribution is a novel uncertainty estimator and a weight-control analysis that bounds the cumulative weighted uncertainty even for general function classes, enabling near-optimal or improved regret bounds across known and unknown corruption and in both bandit and MDP settings. The results provide a unified, computationally efficient framework for robust RL with general function approximation, with the potential to impact scalable, real-world systems facing adversarial perturbations.

Abstract

Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}ζ)$ regret bound, where $T$ is the number of rounds and $ζ$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+ζ)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $ζ$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $ζ$ cases.

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

TL;DR

, parameterized by eluder dimension and covering numbers of the function class, and they handle unknown

via adaptive tuning. A key contribution is a novel uncertainty estimator and a weight-control analysis that bounds the cumulative weighted uncertainty even for general function classes, enabling near-optimal or improved regret bounds across known and unknown corruption and in both bandit and MDP settings. The results provide a unified, computationally efficient framework for robust RL with general function approximation, with the potential to impact scalable, real-world systems facing adversarial perturbations.

Abstract

Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired

regret bound, where

is the number of rounds and

is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of

. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level

in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown

cases.

Paper Structure (31 sections, 18 theorems, 144 equations, 2 algorithms)

This paper contains 31 sections, 18 theorems, 144 equations, 2 algorithms.

Introduction
Related Work
Preliminaries
Nonlinear Contextual Bandits with Corruption
Nonlinear MDPs with Corruption
Eluder Dimension and Covering Number
Algorithms
Bandits with General Function approximation
MDPs with General Function Approximation
Main Results
Bandits with General Function Approximation
MDPs with General Function Approximation
Unknown Corruption Level
Proof Sketch
Conclusions
...and 16 more sections

Key Result

Theorem 4.1

Suppose that Assumption as:bandit holds. For any cumulative corruption $\zeta>0$ and $\delta\in(0,1)$, we take the covering parameter $\gamma=1/(T\zeta)$, the eluder parameter $\lambda=\ln(N(\gamma,{\mathcal{F}}))$, the weighting parameter $\alpha=\sqrt{\ln(N(\gamma,{\mathcal{F}}))}/\zeta$ and the c where $c_0=\sqrt{\eta^2\ln(2/\delta)}$ and $c_{\beta}>0$ is an absolute constant. Then, with probab

Theorems & Definitions (37)

Definition 2.1: Cumulative Corruption for Bandits
Remark 2.2
Definition 2.4: Cumulative corruption
Definition 2.6: $\epsilon$-dependence
Definition 2.7: Eluder Dimension
Definition 2.8: $\epsilon$-cover and covering number
Theorem 4.1
Theorem 4.2
Theorem 4.3
Lemma 5.1
...and 27 more

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

TL;DR

Abstract

Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (37)