Table of Contents
Fetching ...

Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin F. Yang

TL;DR

Hyper addresses the hyperparameter sensitivity of curiosity-driven exploration in reinforcement learning by introducing a repositioning-based framework that decouples exploration from exploitation and regularizes exploration visitation. The algorithm, including a provably efficient Linear-UCB-Hyper variant, provides theoretical efficiency under linear MDP assumptions and demonstrates strong empirical robustness across diverse tasks, maintaining good performance across wide ranges of the curiosity coefficient $\beta$. By isolating task learning from exploration and truncating the repositioning length with a bounded geometric distribution, Hyper mitigates instability due to large intrinsic rewards and distribution shift. Empirically, Hyper matches or surpasses TD3, Curiosity, and Decouple baselines in both exploration-heavy and sparse-reward tasks, with notably reduced sensitivity to hyperparameters and improved stability, highlighting its practical impact for robust exploration in RL.

Abstract

The exploration \& exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf{Hyper}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf{Hyper} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

TL;DR

Hyper addresses the hyperparameter sensitivity of curiosity-driven exploration in reinforcement learning by introducing a repositioning-based framework that decouples exploration from exploitation and regularizes exploration visitation. The algorithm, including a provably efficient Linear-UCB-Hyper variant, provides theoretical efficiency under linear MDP assumptions and demonstrates strong empirical robustness across diverse tasks, maintaining good performance across wide ranges of the curiosity coefficient . By isolating task learning from exploration and truncating the repositioning length with a bounded geometric distribution, Hyper mitigates instability due to large intrinsic rewards and distribution shift. Empirically, Hyper matches or surpasses TD3, Curiosity, and Decouple baselines in both exploration-heavy and sparse-reward tasks, with notably reduced sensitivity to hyperparameters and improved stability, highlighting its practical impact for robust exploration in RL.

Abstract

The exploration \& exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf{Hyper}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf{Hyper} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

Paper Structure

This paper contains 28 sections, 21 theorems, 81 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Theorem 4.2

(Informal) Given the linear-realizability condition, Linear-Hyper learns a near-optimal exploitation policy of any task with high probability. The number of samples required scales polynomially with the intrinsic dimension and the horizon associated with the task.

Figures (9)

  • Figure 1: Performance of pure exploitation, curiosity-driven exploration, and our algorithm with different choices of $\beta$, each data point is the averaged performance after 1M steps training over 5 runs. Curiosity-driven (UCB-Q) is very sensitive to hyperparameter $\beta$. We propose Hyper, which is empirically robust to $\beta$, and theoretically efficient.
  • Figure 2: Comparison of visitation of UCB-Q agent with different exploration coefficient, in the environment with suboptimal goal, optimal goal. Higher visitation is shown in brighter colors. (a) Layout of the environment (b) State visitation of UCB-Q with $\beta=0.01$, Agent gets stuck in sub-optimal policy due to insufficient exploration bonus. (c): State visitation of UCB-Q with $\beta=0.1$, the agent finds a near-optimal policy. (d): State visitation of UCB-Q with $\beta=1.0$, the agent over-explores and cannot learn to exploit due to the value of curiosity bonus is too high.
  • Figure 3: Decoupling causes distribution-shift, where the exploitation policy drastically overestimates its value (yellow), yielding poor performance (blue).
  • Figure 4: Distribution of length of repositioning phase (green) Bounded geometric distribution (blue) Original geometric distribution.
  • Figure 5: Performance of Hyper and baselines. For locomotion tasks, the performance is measured by the episodic cumulative reward, for navigation tasks, it is measured by the success rate instead. Each line is averaged over 5 runs with different random seeds.
  • ...and 4 more figures

Theorems & Definitions (37)

  • Theorem 4.2
  • Theorem B.1
  • Proposition B.3
  • Proposition B.4
  • proof
  • Lemma B.5
  • Lemma B.6
  • proof
  • Remark B.7
  • Lemma B.8
  • ...and 27 more