Table of Contents
Fetching ...

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

Hampus Åström, Elin Anna Topp, Jacek Malec

TL;DR

It is shown that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning, independent of the underlying off-policy learning algorithm.

Abstract

In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

TL;DR

It is shown that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning, independent of the underlying off-policy learning algorithm.

Abstract

In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

Paper Structure

This paper contains 21 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: The Pathological Mountain Car environment, visualized in \ref{['fig:pmc_img']}. An adaption of brockman2016openaigym's Mountain Car, with an additional goal, and a shift making one hill steeper (with higher external reward) and the other slightly flatter. \ref{['fig:pmc_diff']} shows the difference in inclination between our implementation (orange) and the original Mountain Car (blue).
  • Figure 2: Cliff Walker environment, evaluation reward with symlog scale \ref{['fig:cliff_reward']}, as a function of training steps, and optimal behavior rate, \ref{['fig:cliff_optimum']}, with 8 experiments for each method. Takeaway: Our solution reaches the optimal policy quicker.
  • Figure 3: Evaluation reward for Frozen Lake, as a function of training steps, with 4 experiments for each method. Takeaway: Our method gets comparable results, reward-free, with all but intermediate difficulty selection, but our methods and the baseline are worse than an oracle (with $\approx 0.7$ average reward).
  • Figure 4: Evaluation reward, \ref{['fig:pmc_reward']}, and rate of success for reaching the hard goal, \ref{['fig:pmc_hardest']}, with 8 experiments for each method on Pathological Mountain Car, as a function of training steps. When evaluating goal methods, the hard hill is given as target goal. Takeaway: Our method reaches the hard goal faster than the baseline, but does not retain the ability to reach it consistently.
  • Figure 5: The average goal success rate for our methods on Frozen Lake, \ref{['fig:frozen_goal']}, and an example of how success rates vary for goals when training an agent, \ref{['fig:frozen_goals_in_one_run']}. In \ref{['fig:frozen_goals_in_one_run']} goal 15 has the external reward (though our agent is unaware of that). Takeaway: The average stabilizes quickly, but individual goal performance fluctuates wildly, though this could in part be due to the stochasticity of the environment.
  • ...and 1 more figures