Table of Contents
Fetching ...

RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

Zelei Cheng, Xian Wu, Jiahao Yu, Sabrina Yang, Gang Wang, Xinyu Xing

TL;DR

RICE addresses training bottlenecks in reinforcement learning with sparse rewards by using explanation-derived frontiers to refine a pre-trained policy. It constructs a mixed initial state distribution combining default states with critical states identified by an improved StateMask-based mask network and applies exploration via Random Network Distillation within PPO, yielding a tighter sub-optimality bound compared to prior refinement methods. The approach demonstrates strong improvements across MuJoCo benchmarks and real-world tasks, reduces training time for the explanation component, and mitigates overfitting risks through mixed initialization and exploration. This has practical impact for refining pre-trained policies in simulable environments, enabling more efficient and robust policy improvement without full retraining from scratch.

Abstract

Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.

RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

TL;DR

RICE addresses training bottlenecks in reinforcement learning with sparse rewards by using explanation-derived frontiers to refine a pre-trained policy. It constructs a mixed initial state distribution combining default states with critical states identified by an improved StateMask-based mask network and applies exploration via Random Network Distillation within PPO, yielding a tighter sub-optimality bound compared to prior refinement methods. The approach demonstrates strong improvements across MuJoCo benchmarks and real-world tasks, reduces training time for the explanation component, and mitigates overfitting risks through mixed initialization and exploration. This has practical impact for refining pre-trained policies in simulable environments, enabling more efficient and robust policy improvement without full retraining from scratch.

Abstract

Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance.
Paper Structure (31 sections, 5 theorems, 20 equations, 15 figures, 7 tables, 2 algorithms)

This paper contains 31 sections, 5 theorems, 20 equations, 15 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.3

Under assumption:random, we have $\eta(\bar{\pi})$ upper-bounded by $\eta(\pi)$: $\eta(\bar{\pi}) \leq \eta(\pi)$.

Figures (15)

  • Figure 1: Given a pre-trained DRL policy that is not fully optimal (a), we propose the RICE algorithm that resets the RL agent to specific visited states (a mixture of default initial states and identified critical states) (b), followed by an exploration step initiated from these chosen states (c).
  • Figure 2: Agent Refining Performance in two Sparse MuJoCo Games--- For Group (a), we fix the explanation method to our method (mask network) if needed while varying refining methods. For Group (b), we fix the refining method to our method while varying the explanation methods.
  • Figure 3: SAC Agent Refining Performance in Hopper Game --- In the left part, we show the training curve of obtaining a pre-trained policy through the SAC algorithm. In the right part, we show the refining curves of different methods.
  • Figure 4: Visualization of state occupancy measures with respect to different policies and the reward function in a 2-state MDP.
  • Figure 5: Fidelity scores for explanation generated by baseline methods and our proposed explanation method. Note that a higher score implies higher fidelity.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Theorem 3.3
  • Lemma 3.5
  • Theorem 3.6
  • Claim 1
  • Theorem 2.1: Fact 1 eysenbach2021information
  • proof
  • Proposition 2.2
  • proof
  • proof