Table of Contents
Fetching ...

Reinforcement Learning with a Focus on Adjusting Policies to Reach Targets

Akane Tsuboya, Yu Kono, Tatsuji Takahashi

TL;DR

This work tackles exploration efficiency in reinforcement learning by reframing exploration as a pursuit of an aspiration level rather than pure return maximization. It introduces Regional Stochastic Risk-sensitive Satisficing (RS^2), a deep-RL extension that uses reliability estimates from state-vector clusters and a per-state aspiration meta-mechanism to govern exploration via a softmax policy. RS^2 demonstrates strong performance on both dense-reward (CartPole) and sparse-reward (Pyramid) tasks, with rapid early learning and robust handling of non-stationary environments, by expanding exploration early and contracting it as learning progresses. The approach offers practical benefits for real-world control problems where quick attainment of target performance is important and may adapt to changing environments.

Abstract

The objective of a reinforcement learning agent is to discover better actions through exploration. However, typical exploration techniques aim to maximize rewards, often incurring high costs in both exploration and learning processes. We propose a novel deep reinforcement learning method, which prioritizes achieving an aspiration level over maximizing expected return. This method flexibly adjusts the degree of exploration based on the proportion of target achievement. Through experiments on a motion control task and a navigation task, this method achieved returns equal to or greater than other standard methods. The results of the analysis showed two things: our method flexibly adjusts the exploration scope, and it has the potential to enable the agent to adapt to non-stationary environments. These findings indicated that this method may have effectiveness in improving exploration efficiency in practical applications of reinforcement learning.

Reinforcement Learning with a Focus on Adjusting Policies to Reach Targets

TL;DR

This work tackles exploration efficiency in reinforcement learning by reframing exploration as a pursuit of an aspiration level rather than pure return maximization. It introduces Regional Stochastic Risk-sensitive Satisficing (RS^2), a deep-RL extension that uses reliability estimates from state-vector clusters and a per-state aspiration meta-mechanism to govern exploration via a softmax policy. RS^2 demonstrates strong performance on both dense-reward (CartPole) and sparse-reward (Pyramid) tasks, with rapid early learning and robust handling of non-stationary environments, by expanding exploration early and contracting it as learning progresses. The approach offers practical benefits for real-world control problems where quick attainment of target performance is important and may adapt to changing environments.

Abstract

The objective of a reinforcement learning agent is to discover better actions through exploration. However, typical exploration techniques aim to maximize rewards, often incurring high costs in both exploration and learning processes. We propose a novel deep reinforcement learning method, which prioritizes achieving an aspiration level over maximizing expected return. This method flexibly adjusts the degree of exploration based on the proportion of target achievement. Through experiments on a motion control task and a navigation task, this method achieved returns equal to or greater than other standard methods. The results of the analysis showed two things: our method flexibly adjusts the exploration scope, and it has the potential to enable the agent to adapt to non-stationary environments. These findings indicated that this method may have effectiveness in improving exploration efficiency in practical applications of reinforcement learning.

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: CartPole-v0 CartPole
  • Figure 2: Overview of Pyramid task Ikeda22. (a) An example of Pyramid Task with a depth of 6 and a hyperplane coordinate dimensionality of 2. (b) The frequency of reaching each terminal state by a random agent. The white-bordered area represents the set of states that can be reached from all initial states. In this experiment, rewards were assigned in two patterns: the red-bordered state (hard-to-reach) and the orange-bordered state (easy-to-reach).
  • Figure 5: Return achieved by each method in CartPole
  • Figure 6: Return achieved by each algorithm in Pyramid task
  • Figure 9: Exploration tendencies of each agent. (a) Yellow indicates state with reward. Dark gray indicates groups of states neighboring the reward. Black indicates groups of states distant from the reward. (b), (c), and (d) show the number of times each agent visited the yellow, dark gray, and black states, respectively.