Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Jiayi Huang; Han Zhong; Liwei Wang; Lin F. Yang

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Jiayi Huang, Han Zhong, Liwei Wang, Lin F. Yang

TL;DR

The paper addresses RL with heavy-tailed rewards under linear function approximation, introducing Heavy-OFUL for linear bandits and Heavy-LSVI-UCB for linear MDPs. It leverages a novel adaptive Huber regression-based self-normalized concentration inequality to achieve instance-dependent regret bounds that depend on central moments of rewards, including the challenging case ε in (0,1]. The main contributions are minimax-optimal bandit bounds and the first computationally efficient, instance-dependent MDP bounds in this setting, along with a matching lower bound demonstrating optimality. The results show that finite variance (ε=1) suffices to achieve variance-aware regret comparable to bounded-reward scenarios, broadening applicability to more realistic heavy-tailed environments with practical implications for robust RL in noisy domains.

Abstract

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with only finite $(1+ε)$-th moments for some $ε\in(0,1]$. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round regret of $\tilde{O}\big(d T^{\frac{1-ε}{2(1+ε)}} \sqrt{\sum_{t=1}^T ν_t^2} + d T^{\frac{1-ε}{2(1+ε)}}\big)$, the \emph{first} of this kind. Here, $d$ is the feature dimension, and $ν_t^{1+ε}$ is the $(1+ε)$-th central moment of the reward at the $t$-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent} $K$-episode regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+ε} + d \sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and $\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound $Ω(d H K^{\frac{1}{1+ε}} + d \sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

TL;DR

Abstract

-th moments for some

. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for heavy-tailed linear bandits, achieving an \emph{instance-dependent}

-round regret of

, the \emph{first} of this kind. Here,

is the feature dimension, and

is the

-th central moment of the reward at the

-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the \emph{first} computationally efficient \emph{instance-dependent}

-episode regret of

. Here,

is length of the episode, and

are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound

to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

Paper Structure (83 sections, 38 theorems, 218 equations, 1 figure, 2 tables, 3 algorithms)

This paper contains 83 sections, 38 theorems, 218 equations, 1 figure, 2 tables, 3 algorithms.

Introduction
Road Map
Notations
Preliminaries
Heavy-Tailed Linear Bandits
Linear MDPs with Heavy-Tailed Rewards
Adaptive Huber Regression
Linear Bandits
Algorithm Description
Regret Analysis
Linear MDPs
Algorithm Description
Estimation for expected heavy-tailed rewards
Estimation for central moment of rewards
Estimation for expected next-state value functions
...and 68 more sections

Key Result

Theorem 3.3

For the online regression problems in Definition def:regression, we solve for $\bm{\theta}_t$ by adaptive Huber regression in Algorithm algo:huber with $c_0,c_1,\tau_0$ in Appendix proof:heavy. Then for any $\delta\in(0,1)$, with probability at least $1- 3\delta$, for all $t \le T$, we have $\|\bm{\

Figures (1)

Figure 1: Comparisons of our algorithm ($\textsc{Heavy-OFUL}$) versus MENU and TOFU in heavy-tailed linear bandits problems for $1\times10^4$ rounds.

Theorems & Definitions (48)

Definition 2.1: Heterogeneous linear bandits with heavy-tailed rewards
Definition 2.2
Remark 2.6
Definition 3.1: Huber loss
Definition 3.2
Theorem 3.3
Remark 3.4
Theorem 4.1
Remark 4.2
Theorem 5.1
...and 38 more

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

TL;DR

Abstract

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (48)