Table of Contents
Fetching ...

Relating Reinforcement Learning to Dynamic Programming-Based Planning

Filip V. Georgiev, Kalle G. Timperi, Başak Sakçak, Steven M. LaValle

TL;DR

This paper bridges some of the gap between optimal planning and reinforcement learning, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors to advocate for defining and optimizing truecost.

Abstract

This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.

Relating Reinforcement Learning to Dynamic Programming-Based Planning

TL;DR

This paper bridges some of the gap between optimal planning and reinforcement learning, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors to advocate for defining and optimizing truecost.

Abstract

This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.
Paper Structure (18 sections, 4 theorems, 21 equations, 8 figures, 50 tables)

This paper contains 18 sections, 4 theorems, 21 equations, 8 figures, 50 tables.

Key Result

proposition 1

If every state-action pair $(x,u)$ is visited infinitely often for all $x \in X$ and $u \in U(x)$, and (eqn:qvalitdet) is applied in every step, then after a finite number of iterations, $\hat{Q}^* = Q^*$.

Figures (8)

  • Figure 1: Ten representative problems (enumerated from top-left to bottom-right as 0, 1, 3, 4, 8, 9, 10, 11, 12, 16). The green dot is the initial state. The red dot is the goal state. The white segments connect the states and can be viewed as the actions allowed by the robot. All actions have the same step cost. The gray areas are obstacles.
  • Figure 2: The effect of decreasing greediness in a policy when applied on different problems.
  • Figure 3: The difference between a single trial (unspecified horizon) versus infinite horizon formulation. The optimal path for a single trial is indicated with black arrows going from $x_I$ to $x_g$. A more costly, but shorter, path is indicated with red arrows. In the infinite-horizon formulation, a shortcut (dashed arrow) is introduced that takes the robot from $x_g$ and returns it to $x_I$. The dark blue squares indicate a region in state space where the robot could loop indefinitely with negligible average cost per step without ever reaching the goal.
  • Figure 4: Ten problems (enumerated from top-left to bottom-right as 16, 8, 0, 1, 2, 3, 4, 5, 7, 9). The green dot is the initial state, while the red dot is the goal state. The white lines connect the states and can be viewed as the actions allowed by the robot. The gray circles represent obstacles that cannot be passed through
  • Figure 5: Additional six problems (enumerated from top-left to bottom-right as 10 - 15).
  • ...and 3 more figures

Theorems & Definitions (8)

  • proposition 1
  • proof
  • proposition 2
  • proof
  • proposition 3
  • proposition 4
  • proof
  • proof