Table of Contents
Fetching ...

Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms

Robin Schmöcker, Alexander Dockhorn, Bodo Rosenhahn

TL;DR

This paper proposes and empirically evaluates several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.

Abstract

One weakness of Monte Carlo Tree Search (MCTS) is its sample efficiency which can be addressed by building and using state and/or action abstractions in parallel to the tree search such that information can be shared among nodes of the same layer. The primary usage of abstractions for MCTS is to enhance the Upper Confidence Bound (UCB) value during the tree policy by aggregating visits and returns of an abstract node. However, this direct usage of abstractions does not take the case into account where multiple actions with the same parent might be in the same abstract node, as these would then all have the same UCB value, thus requiring a tiebreak rule. In state-of-the-art abstraction algorithms such as pruned On the Go Abstractions (pruned OGA), this case has not been noticed, and a random tiebreak rule was implicitly chosen. In this paper, we propose and empirically evaluate several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.

Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms

TL;DR

This paper proposes and empirically evaluates several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.

Abstract

One weakness of Monte Carlo Tree Search (MCTS) is its sample efficiency which can be addressed by building and using state and/or action abstractions in parallel to the tree search such that information can be shared among nodes of the same layer. The primary usage of abstractions for MCTS is to enhance the Upper Confidence Bound (UCB) value during the tree policy by aggregating visits and returns of an abstract node. However, this direct usage of abstractions does not take the case into account where multiple actions with the same parent might be in the same abstract node, as these would then all have the same UCB value, thus requiring a tiebreak rule. In state-of-the-art abstraction algorithms such as pruned On the Go Abstractions (pruned OGA), this case has not been noticed, and a random tiebreak rule was implicitly chosen. In this paper, we propose and empirically evaluate several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.

Paper Structure

This paper contains 19 sections, 8 equations, 33 figures, 4 tables.

Figures (33)

  • Figure 1: Assume that MCTS with the visualized fixed action abstraction is run on the following deterministic depth-1 game tree. This abstraction would also be discovered by $(0.1,0)$-OGA since when all four actions are played at least $K$ times, $(0.1,0)$-OGA will have abstracted actions 1,2, and 3,4. While the visits will converge to choosing the abstract node with actions 3 and 4 for both the RANDOM and UCT intra-abstraction policy, RANDOM will distribute its visits uniformly amongst 3 and 4, resulting in an average payoff of $1.05$. This shows that with RANDOM, convergence to the optimal action is in general not guaranteed. In contrast, the UCT intra-abstraction policy guarantees convergence to the average payoff of $1.1$ by converging to action 4.
  • Figure 2: The pairings and relative improvement score for all iteration budgets combined of the best performing parameter-combination of each intra-abstraction policy are shown.
  • Figure 4: A showcase of how the ASAP abstraction framework, which itself is a special case of ASASAP abstractions, would detect equivalences in the following 5-state MDP. Each node represents a state, and arrows represent deterministic actions with the same immediate reward of 0. The dotted ovals represent abstractions. Initially, in (a), all states and state-action pairs are in their own singleton abstract node. Then, in (b) the next state-action pair abstraction is constructed (the application of function $f$ from Section \ref{['sec:foundations']}) from this initial state abstraction, which groups the actions of nodes 3 and 4 because they have the same immediate reward and the same transition distribution. From this state-action pair abstraction the next state abstraction is constructed in (c), (the application of function $g$ from Section \ref{['sec:foundations']}) which groups nodes 3 and 4 because they have the same set of abstract state-action pairs. Then again, in (d) the next state-action pair abstraction is constructed which also groups the actions from nodes 1 and 2 because they have the same abstract successor. Then a state abstraction is constructed again in (e), which groups states 1 and 2. Then further applications of $f$ or $g$ would have no effect, hence this abstraction is converged.
  • Figure : (a) Academic Advising
  • Figure : (a) 100 iterations
  • ...and 28 more figures