Table of Contents
Fetching ...

Is Exploration All You Need? Effective Exploration Characteristics for Transfer in Reinforcement Learning

Jonathan C. Balloch, Rishav Bhagat, Geigh Zollicoffer, Ruoran Jia, Julia Kim, Mark O. Riedl

TL;DR

This work tackles the problem of understanding which exploration characteristics enable efficient online transfer in deep reinforcement learning under non-stationary novelties. It conducts a large-scale empirical study, evaluating eleven exploration algorithms on diversified two-environment transfer problems, and introduces a taxonomy based on exploration principles (stochasticity, explicit diversity, separate objective) and temporal locality, plus algorithmic instantiation. Using metrics such as Convergence efficiency, Adaptive efficiency, Final adaptive performance, and Tr-AUC, it finds that explicit diversity and stochasticity are the most consistently beneficial for transfer across novelties and environments, while the benefits of time-dependent exploration vary by task and novelty type. The results provide practical guidance for selecting and combining exploration characteristics to improve online task transfer in real-world, non-stationary RL settings and suggest directions for dynamic, transfer-aware exploration design.

Abstract

In deep reinforcement learning (RL) research, there has been a concerted effort to design more efficient and productive exploration methods while solving sparse-reward problems. These exploration methods often share common principles (e.g., improving diversity) and implementation details (e.g., intrinsic reward). Prior work found that non-stationary Markov decision processes (MDPs) require exploration to efficiently adapt to changes in the environment with online transfer learning. However, the relationship between specific exploration characteristics and effective transfer learning in deep RL has not been characterized. In this work, we seek to understand the relationships between salient exploration characteristics and improved performance and efficiency in transfer learning. We test eleven popular exploration algorithms on a variety of transfer types -- or ``novelties'' -- to identify the characteristics that positively affect online transfer learning. Our analysis shows that some characteristics correlate with improved performance and efficiency across a wide range of transfer tasks, while others only improve transfer performance with respect to specific environment changes. From our analysis, make recommendations about which exploration algorithm characteristics are best suited to specific transfer situations.

Is Exploration All You Need? Effective Exploration Characteristics for Transfer in Reinforcement Learning

TL;DR

This work tackles the problem of understanding which exploration characteristics enable efficient online transfer in deep reinforcement learning under non-stationary novelties. It conducts a large-scale empirical study, evaluating eleven exploration algorithms on diversified two-environment transfer problems, and introduces a taxonomy based on exploration principles (stochasticity, explicit diversity, separate objective) and temporal locality, plus algorithmic instantiation. Using metrics such as Convergence efficiency, Adaptive efficiency, Final adaptive performance, and Tr-AUC, it finds that explicit diversity and stochasticity are the most consistently beneficial for transfer across novelties and environments, while the benefits of time-dependent exploration vary by task and novelty type. The results provide practical guidance for selecting and combining exploration characteristics to improve online task transfer in real-world, non-stationary RL settings and suggest directions for dynamic, transfer-aware exploration design.

Abstract

In deep reinforcement learning (RL) research, there has been a concerted effort to design more efficient and productive exploration methods while solving sparse-reward problems. These exploration methods often share common principles (e.g., improving diversity) and implementation details (e.g., intrinsic reward). Prior work found that non-stationary Markov decision processes (MDPs) require exploration to efficiently adapt to changes in the environment with online transfer learning. However, the relationship between specific exploration characteristics and effective transfer learning in deep RL has not been characterized. In this work, we seek to understand the relationships between salient exploration characteristics and improved performance and efficiency in transfer learning. We test eleven popular exploration algorithms on a variety of transfer types -- or ``novelties'' -- to identify the characteristics that positively affect online transfer learning. Our analysis shows that some characteristics correlate with improved performance and efficiency across a wide range of transfer tasks, while others only improve transfer performance with respect to specific environment changes. From our analysis, make recommendations about which exploration algorithm characteristics are best suited to specific transfer situations.
Paper Structure (23 sections, 1 equation, 5 figures, 8 tables)

This paper contains 23 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Environments and novelties used to evaluate the exploration algorithms and their characteristics. This shows the mixture novelties and discrete and continuous control environments.
  • Figure 2: Full learning and adaptation process of all eleven RL exploration algorithms on the DoorKeyChange novelty problem from NovGrid balloch2022novgrid. 11 different RL agents first learn a task assuming a stationary MDP. The rate of learning at this stage is convergence efficiency. At time step 5,000,000 novelty is injected into the environment, transfering from $MDP_\mathrm{source}$ to $MDP_\mathrm{target}$, and often causing a performance drop-off. The algorithms then recover their past performance as they learn the new world transition dynamics. The rate of learning at this stage is adaptive efficiency. The maximum episode reward is the final adaptive performance, which may not always be as high as pre-novelty performance.
  • Figure 3: The Adaptive Efficiency and Tr-AUC inter-quartile mean plots for DoorKeyChange. These plots show NoisyNets performing well by both metrics. It should be noted that the Adaptive Efficiency graphs are only showing runs that converged on both tasks and the Tr-AUC graphs are filtering for runs that converged on the first task.
  • Figure 4: Results from the LavaSafe shortcut novelty. Some of the exploration algorithms are able to find the shortcut, rising above the pre-novelty performance, while others never discover the shortcut. The dotted blue line indicates where novelty was injected.
  • Figure 5: Results from walker thigh length change task.