Table of Contents
Fetching ...

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

Mengfan Xu, Diego Klabjan

TL;DR

This work addresses exploration in RL under scale-free, potentially unbounded rewards by introducing EXP4.P for contextual bandits and EXP4-RL for RL. EXP4.P delivers sublinear regret bounds in bounded and unbounded contexts and demonstrates that a capable expert can enable global optimality in linear contextual settings, with analogous gains extended to EXP3.P for MAB. The EXP4-RL framework integrates multiple RL experts (e.g., RND and DQN-based policies) with exponential weighting to encourage global exploration, and empirical results on Mountain Car and Montezuma's Revenge show notable improvements over standard intrinsic reward baselines. Overall, the paper provides theoretical guarantees for unbounded reward settings and practical, scalable RL exploration methods with demonstrable performance improvements in hard-to-explore environments.

Abstract

We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

TL;DR

This work addresses exploration in RL under scale-free, potentially unbounded rewards by introducing EXP4.P for contextual bandits and EXP4-RL for RL. EXP4.P delivers sublinear regret bounds in bounded and unbounded contexts and demonstrates that a capable expert can enable global optimality in linear contextual settings, with analogous gains extended to EXP3.P for MAB. The EXP4-RL framework integrates multiple RL experts (e.g., RND and DQN-based policies) with exponential weighting to encourage global exploration, and empirical results on Mountain Car and Montezuma's Revenge show notable improvements over standard intrinsic reward baselines. Overall, the paper provides theoretical guarantees for unbounded reward settings and practical, scalable RL exploration methods with demonstrable performance improvements in hard-to-explore environments.

Abstract

We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

Paper Structure

This paper contains 35 sections, 23 theorems, 120 equations, 3 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Let $0 \leq r^t \leq 1$ for every $t$. For any fixed time horizon $T > 0$, for all $K, \, N \geq 2$ and for any $1 > \delta > 0$, $\gamma = \sqrt{\frac{3K\ln{N}}{T(\frac{2N}{3} + 1) }} \leq \frac{1}{2}$, $\alpha = 2\sqrt{K\ln{\frac{NT}{\delta}}}$, we have that with probability at least $1-\delta$, $

Figures (3)

  • Figure 1: The performance of Algorithm \ref{['exp4_rl']} and RND measured by the epoch-wise reward on Mountain Car
  • Figure 2: The performance of Algorithm \ref{['exp4_rl']} and RND measured by intrinsic reward without parallel environments with three different burn-in periods
  • Figure 4: The framework of regret analysis in non-stochastic bandits.

Theorems & Definitions (41)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Theorem 9
  • Theorem 10
  • ...and 31 more