Table of Contents
Fetching ...

Hierarchical Universal Value Function Approximators

Rushiv Arora

TL;DR

This work develops supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; \theta)$ and $Q(s, g, o, a; \theta)$.

Abstract

There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: $Q(s, g, o; θ)$ and $Q(s, g, o, a; θ)$. Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

Hierarchical Universal Value Function Approximators

TL;DR

This work develops supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: and .

Abstract

There have been key advancements to building universal approximators for multi-goal collections of reinforcement learning value functions -- key elements in estimating long-term returns of states in a parameterized manner. We extend this to hierarchical reinforcement learning, using the options framework, by introducing hierarchical universal value function approximators (H-UVFAs). This allows us to leverage the added benefits of scaling, planning, and generalization expected in temporal abstraction settings. We develop supervised and reinforcement learning methods for learning embeddings of the states, goals, options, and actions in the two hierarchical value functions: and . Finally we demonstrate generalization of the HUVFAs and show they outperform corresponding UVFAs.

Paper Structure

This paper contains 16 sections, 14 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: The multi-stream architecture of H-UVFAs. In different temporal abstraction methods with more hierarchies, there will be more streams. This will result in high-order dimensionality reductions, which our method can accommodate as we will see in the upcoming sections. Left: The three-stream architecture of the policy-over-options (meta-policy) H-UVFA. The streams are: states, goals, options. Right: The four-stream architecture of the intra-option policy H-UVFA. The streams are: states, goals, options, actions.
  • Figure 2: A comparison of the values and actions of the ground truth and H-UVFAs. Red represents higher values for each state, blue represents lower values of being in a state, and the orange circle represents the goal state. The arrows represent the greedy action in each state obtained from picking a greedy option. It is interesting to note that using neural networks/function approximators (H-UVFAs) allows for gradual and smooth changes in the states' values as compared to the more drastic ones seen in the discrete and tabular ground truth. This better represents policies and their values and allows for better agent behaviors in long-trajectories where the agent is far from the goal.
  • Figure 3: Comparison of the ground truth, H-UVFAs, and UVFAs on goals seen during training as measured in averages and standard deviations over 10 episodes for 5 goals. H-UVFAs and ground truth are comparable in performance while UVFAs have poor performance and higher variance. The poor performance of UVFAs as compared to H-UVFAs in hierarchical settings is is due to the loss of information when a large amount of information is compressed in a compact representation, as discussed in Section \ref{['uvfa-comparison']}.
  • Figure 4: Comparison of H-UVFAs and the UVFAs baseline in generalization to unseen goals as measured in average steps to goal and standard deviation over 10 episodes each for 3 goals. H-UVFAs generalize well in hierarchical settings to unseen goals. The variance for both methods increases as compared to that on trained goals.
  • Figure 5: Values and policies for unseen goals in the fourth room. Red indicates higher values while the arrows indicate the greedy action for the greedy policy. The near-optimal value functions and policies indicate that H-UVFAs can extrapolate to create optimal hierarchical behaviors.
  • ...and 6 more figures