Reinforcement Learning with a Focus on Adjusting Policies to Reach Targets
Akane Tsuboya, Yu Kono, Tatsuji Takahashi
TL;DR
This work tackles exploration efficiency in reinforcement learning by reframing exploration as a pursuit of an aspiration level rather than pure return maximization. It introduces Regional Stochastic Risk-sensitive Satisficing (RS^2), a deep-RL extension that uses reliability estimates from state-vector clusters and a per-state aspiration meta-mechanism to govern exploration via a softmax policy. RS^2 demonstrates strong performance on both dense-reward (CartPole) and sparse-reward (Pyramid) tasks, with rapid early learning and robust handling of non-stationary environments, by expanding exploration early and contracting it as learning progresses. The approach offers practical benefits for real-world control problems where quick attainment of target performance is important and may adapt to changing environments.
Abstract
The objective of a reinforcement learning agent is to discover better actions through exploration. However, typical exploration techniques aim to maximize rewards, often incurring high costs in both exploration and learning processes. We propose a novel deep reinforcement learning method, which prioritizes achieving an aspiration level over maximizing expected return. This method flexibly adjusts the degree of exploration based on the proportion of target achievement. Through experiments on a motion control task and a navigation task, this method achieved returns equal to or greater than other standard methods. The results of the analysis showed two things: our method flexibly adjusts the exploration scope, and it has the potential to enable the agent to adapt to non-stationary environments. These findings indicated that this method may have effectiveness in improving exploration efficiency in practical applications of reinforcement learning.
