Table of Contents
Fetching ...

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

Han-Dong Lim, Donghwan Lee

TL;DR

This paper considers two sufficient conditions for the existence of a solution to PBE : strictly negatively row dominating diagonal (SNRDD) assumption and a condition motivated by the convergence of AVI.

Abstract

In this paper, we study the theoretical properties of the projected Bellman equation (PBE) and two algorithms to solve this equation: linear Q-learning and approximate value iteration (AVI). We consider two sufficient conditions for the existence of a solution to PBE : strictly negatively row dominating diagonal (SNRDD) assumption and a condition motivated by the convergence of AVI. The SNRDD assumption also ensures the convergence of linear Q-learning, and its relationship with the convergence of AVI is examined. Lastly, several interesting observations on the solution of PBE are provided when using $ε$-greedy policy.

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

TL;DR

This paper considers two sufficient conditions for the existence of a solution to PBE : strictly negatively row dominating diagonal (SNRDD) assumption and a condition motivated by the convergence of AVI.

Abstract

In this paper, we study the theoretical properties of the projected Bellman equation (PBE) and two algorithms to solve this equation: linear Q-learning and approximate value iteration (AVI). We consider two sufficient conditions for the existence of a solution to PBE : strictly negatively row dominating diagonal (SNRDD) assumption and a condition motivated by the convergence of AVI. The SNRDD assumption also ensures the convergence of linear Q-learning, and its relationship with the convergence of AVI is examined. Lastly, several interesting observations on the solution of PBE are provided when using -greedy policy.

Paper Structure

This paper contains 32 sections, 26 theorems, 66 equations, 4 figures, 2 algorithms.

Key Result

Theorem 3.2

Figures (4)

  • Figure 1: The first and last two figures show Example \ref{['ex:snrdd:local-minima']} and \ref{['ex:eps-unstable']} in Appendix \ref{['sec:mdp_examples']}, respectively. In the last figure, stable and unstable refers to whether ${\bm{T}}({\bm{\theta}},\pi_{{\bm{\theta}}},\beta_{{\bm{\theta}}})$ is a Hurwitz matrix at each point.
  • Figure 2: The first two and last two figures show experimental results on Example \ref{['ex:all-snrdd-but-not-avi-convergence']} and \ref{['ex:all-avi-but-hurwitz']}, respectively. For reproducibility, the experiments are done with an expected update version of Q-learning provided in Algorithm \ref{['algo:deterministic-q']} in Appendix \ref{['sec:algo']}.
  • Figure 3: Experimental results on Example \ref{['ex:all-snrdd-but-not-avi-convergence']}.
  • Figure 4: Experimental results on Example \ref{['ex:all-avi-but-hurwitz']}.

Theorems & Definitions (69)

  • Definition 3.1: molchanov1989criteria
  • Theorem 3.2
  • Remark 3.3
  • Remark 3.4
  • Remark 3.5
  • Lemma 3.6
  • Remark 3.7
  • Remark 3.8
  • Remark 3.9
  • Theorem 3.10
  • ...and 59 more