Table of Contents
Fetching ...

Match or Replay: Self Imitating Proximal Policy Optimization

Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal

Abstract

Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.

Match or Replay: Self Imitating Proximal Policy Optimization

Abstract

Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.

Paper Structure

This paper contains 17 sections, 10 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Results show the performance of 8 MuJoCo towers_gymnasium_2023 continuous control tasks (refer to Figure \ref{['fig:SIPP-match']} for results on all tasks). The plots show the learning curves and the episodic rewards along the y-axis, evaluated under the current policy. The reported results are the mean across seven seeds, with shaded regions indicating the standard deviation. The proposed algorithms outperform all baselines across all tasks, achieving competitive or better performance.
  • Figure 2: All tasks feature one goal and one agent. The agent's and goal's positions are randomly selected at the start of each episode from a predefined set of fixed initial positions. Each episode initializes the environment by sampling these positions, ensuring variability while maintaining a structured distribution. There is only one source of reward per environment, i.e., a binary reward is provided for reaching the goal. The agent observes the arena through a first-person view with partial visibility, reflecting the limitations of a partially observable environment.
  • Figure 3: Results show the performance on 4 PointMaze gymnasium_robotics2023github multi-goal sparse reward tasks (refer to Figure \ref{['fig:maze_result']} for results on all tasks). The plots show the learning curves and the episodic rewards along the y-axis, evaluated under the current policy. The reported results are the mean across seven different seeds. The proposed algorithms outperform all the baselines by a significant margin.
  • Figure 4: Results show the performance on the 4 Animal-AI Olympics environment crosby2019animal binary reward tasks (refer to Figure \ref{['fig:animal']} for results on all tasks). The plots show the learning curves, with episodic rewards (success rate) on the y-axis, evaluated under the current policy. The reported results are the mean across 5 seeds, with shaded regions highlighting the standard deviation. The proposed algorithms outperform PPO by a significant margin.
  • Figure 5: Results show the ablation study on 4 MuJoCo towers_gymnasium_2023 continuous control tasks (refer to Figure \ref{['fig:match_abe']} for complete results). The parameter $\epsilon$ controls the balance between exploration and exploitation. The plots show the learning curves and the episodic rewards along the y-axis, evaluated under the current policy with different $\epsilon$. The reported results are the mean across 5 seeds, with shaded regions highlighting the standard deviation.
  • ...and 6 more figures