Relating Reinforcement Learning to Dynamic Programming-Based Planning

Filip V. Georgiev; Kalle G. Timperi; Başak Sakçak; Steven M. LaValle

Relating Reinforcement Learning to Dynamic Programming-Based Planning

Filip V. Georgiev, Kalle G. Timperi, Başak Sakçak, Steven M. LaValle

TL;DR

This paper bridges some of the gap between optimal planning and reinforcement learning, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors to advocate for defining and optimizing truecost.

Abstract

This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.

Relating Reinforcement Learning to Dynamic Programming-Based Planning

TL;DR

Abstract

Paper Structure (18 sections, 4 theorems, 21 equations, 8 figures, 50 tables)

This paper contains 18 sections, 4 theorems, 21 equations, 8 figures, 50 tables.

Introduction
Optimal Planning and Deterministic RL
Optimal planning concepts
Handling imperfect models by derandomizing RL
Computational comparisons
Analysis of Cost/Reward Models
Cost and reward models are equivalent, almost
The dangers of discounting
Episodic equivalences
From Deterministic to Stochastic Models
Basic assumptions and approach
Computational comparisons
Conclusions
Proofs of Propositions
Problems
...and 3 more sections

Key Result

proposition 1

If every state-action pair $(x,u)$ is visited infinitely often for all $x \in X$ and $u \in U(x)$, and (eqn:qvalitdet) is applied in every step, then after a finite number of iterations, $\hat{Q}^* = Q^*$.

Figures (8)

Figure 1: Ten representative problems (enumerated from top-left to bottom-right as 0, 1, 3, 4, 8, 9, 10, 11, 12, 16). The green dot is the initial state. The red dot is the goal state. The white segments connect the states and can be viewed as the actions allowed by the robot. All actions have the same step cost. The gray areas are obstacles.
Figure 2: The effect of decreasing greediness in a policy when applied on different problems.
Figure 3: The difference between a single trial (unspecified horizon) versus infinite horizon formulation. The optimal path for a single trial is indicated with black arrows going from $x_I$ to $x_g$. A more costly, but shorter, path is indicated with red arrows. In the infinite-horizon formulation, a shortcut (dashed arrow) is introduced that takes the robot from $x_g$ and returns it to $x_I$. The dark blue squares indicate a region in state space where the robot could loop indefinitely with negligible average cost per step without ever reaching the goal.
Figure 4: Ten problems (enumerated from top-left to bottom-right as 16, 8, 0, 1, 2, 3, 4, 5, 7, 9). The green dot is the initial state, while the red dot is the goal state. The white lines connect the states and can be viewed as the actions allowed by the robot. The gray circles represent obstacles that cannot be passed through
Figure 5: Additional six problems (enumerated from top-left to bottom-right as 10 - 15).
...and 3 more figures

Theorems & Definitions (8)

proposition 1
proof
proposition 2
proof
proposition 3
proposition 4
proof
proof

Relating Reinforcement Learning to Dynamic Programming-Based Planning

TL;DR

Abstract

Relating Reinforcement Learning to Dynamic Programming-Based Planning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)