Table of Contents
Fetching ...

Parameterized Projected Bellman Operator

Théo Vincent, Alberto Maria Metelli, Boris Belousov, Jan Peters, Marcello Restelli, Carlo D'Eramo

TL;DR

This work proposes a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI, which is able to generalize across transition samples and avoid the computationally intensive projection step.

Abstract

Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.

Parameterized Projected Bellman Operator

TL;DR

This work proposes a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI, which is able to generalize across transition samples and avoid the computationally intensive projection step.

Abstract

Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.
Paper Structure (28 sections, 4 theorems, 19 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 4 theorems, 19 equations, 11 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2

(See Theorem 3.4 of farahmand2011regularization) Let $K \in {\mathbb{N}}^*$, $\rho, \nu$ two distribution probabilities over $\mathcal{S} \times \mathcal{A}$. For any sequence $(Q_k)_{k=0}^K \subset B \left(\mathcal{S} \times \mathcal{A}, R_{\gamma} \right)$ where $R_{\gamma}$ depends on reward func where $\alpha_k$ and $C_{K, \gamma, R_{\gamma}}$ do not depend on the sequence $(Q_k)_{k=0}^K$. $F(

Figures (11)

  • Figure 1: PBO $\Lambda$ (left) operates on value function parameters as opposed to AVI (right), that uses the empirical Bellman operator $\Gamma$ followed by the projection operator $\Pi_{\text{proj}}$.
  • Figure 2: Behavior of our PBO and AVI in the parametric space of value functions $\mathcal{Q}$. Here $Q^*$, $Q_{{\bm{\omega}}^*}$, and $Q_{\Lambda^\infty}$, are respectively the optimal value function, its projection on the parametric space, and the fixed point of PBO. Contrary to the regular Bellman operator, PBO can be applied for an arbitrary number of steps (blue lines) without requiring additional samples.
  • Figure 3: Behavior of ProFQI and FQI in the function space $\mathcal{Q}_\Omega$ for one iteration. The ability to apply PBO for an arbitrary number of times enables ProFQI to generate a sequence of action-value functions $Q_{\Lambda^k_\phi}({\bm{\omega}}_0)$ that can be used to enrich the loss function to learn PBO (see red lines). On the contrary, one iteration of FQI corresponds to a single application of the Bellman operator followed by the projection step onto the function space.
  • Figure 4: $\ell_2$-norm of the difference between the optimal action-value function and the approximated action-value function on chain-walk. Here $K$ is the number of iterations included in the training loss (\ref{['E:pbo_loss']}) of PBO, while $k$ is the number of PBO applications after training. Results are averaged over $20$ seeds. Note that PBO enables using $k\geq K$ compared to FQI, which results in better convergence for increasing $k$ for each fixed $K \in \{2, 5, 15\}$.
  • Figure 5: Test task: chain-walk from munos2003error. The reward is $0$ in all states except green states where it equals $1$.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Definition 1
  • Theorem 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • proof
  • proof
  • proof
  • proof