Table of Contents
Fetching ...

Fitted Q-Iteration via Max-Plus-Linear Approximation

Y. Liu, M. A. S. Kolarijani

TL;DR

This work advances offline reinforcement learning by introducing max-plus-linear approximators for the Q-function within fitted Q-iteration. It proposes MP-FQI and a variational variant (v-MP-FQI) that leverage the Bellman operator’s compatibility with max-plus algebra to achieve provable linear convergence, with per-iteration complexities that scale favorably either with the number of samples ($\mathcal{O}(np)$) or with the number of test functions ($\mathcal{O}(pq)$). The algorithms exploit MP-regression structure to reduce updates to max-plus matrix-vector operations, and the variational formulation achieves sample-size independence in per-iteration costs. Numerical experiments on a DC motor control problem show improved greedy policies over standard FQI, illustrating practical benefits for offline RL with MP representations. The paper also discusses extensions, including sparse MP solutions and fast transforms, to broaden applicability and efficiency.

Abstract

In this study, we consider the application of max-plus-linear approximators for Q-function in offline reinforcement learning of discounted Markov decision processes. In particular, we incorporate these approximators to propose novel fitted Q-iteration (FQI) algorithms with provable convergence. Exploiting the compatibility of the Bellman operator with max-plus operations, we show that the max-plus-linear regression within each iteration of the proposed FQI algorithm reduces to simple max-plus matrix-vector multiplications. We also consider the variational implementation of the proposed algorithm which leads to a per-iteration complexity that is independent of the number of samples.

Fitted Q-Iteration via Max-Plus-Linear Approximation

TL;DR

This work advances offline reinforcement learning by introducing max-plus-linear approximators for the Q-function within fitted Q-iteration. It proposes MP-FQI and a variational variant (v-MP-FQI) that leverage the Bellman operator’s compatibility with max-plus algebra to achieve provable linear convergence, with per-iteration complexities that scale favorably either with the number of samples () or with the number of test functions (). The algorithms exploit MP-regression structure to reduce updates to max-plus matrix-vector operations, and the variational formulation achieves sample-size independence in per-iteration costs. Numerical experiments on a DC motor control problem show improved greedy policies over standard FQI, illustrating practical benefits for offline RL with MP representations. The paper also discusses extensions, including sparse MP solutions and fast transforms, to broaden applicability and efficiency.

Abstract

In this study, we consider the application of max-plus-linear approximators for Q-function in offline reinforcement learning of discounted Markov decision processes. In particular, we incorporate these approximators to propose novel fitted Q-iteration (FQI) algorithms with provable convergence. Exploiting the compatibility of the Bellman operator with max-plus operations, we show that the max-plus-linear regression within each iteration of the proposed FQI algorithm reduces to simple max-plus matrix-vector multiplications. We also consider the variational implementation of the proposed algorithm which leads to a per-iteration complexity that is independent of the number of samples.
Paper Structure (31 sections, 11 theorems, 68 equations, 1 figure, 2 algorithms)

This paper contains 31 sections, 11 theorems, 68 equations, 1 figure, 2 algorithms.

Key Result

Lemma 2.1

Consider the two functions $f,\tilde{f} \in \underline {\mathbb{R}}^{\mathsf{Z}}$ and the scalar $\alpha \in \underline {\mathbb{R}}$. Define $[\max \{ f, \tilde{f} \}](z) = \max\{f(z),\tilde{f}(z)\}$ and $[\alpha + f](z) = \alpha + f(z)$ for all $z\in\mathsf{Z}$. We have

Figures (1)

  • Figure 1: DC motor stabilization problem. Top-Left: Average reward of 100 instances of the problem with random initial state over $T=100$ time steps. Solid (resp. dashed) lines correspond to quadratic (resp. distance) functions in (v-)MP-FQI and RBF (resp. indicator functions) in FQI for state features. Top-Right: The running time of the algorithms. Solid lines (resp. dashed) correspond to compilation (resp. per-iteration) time. Bottom: Convergence of algorithms. Solid and dashed lines are the same as in the top-left figure.

Theorems & Definitions (13)

  • Lemma 2.1: MP additivity and homogeneity of $\mathbf{B}_{\mathrm{s}}$
  • Lemma 2.2: Non-expansiveness of MP-linear operators
  • Proposition 3.1: MP empirical BE
  • Proposition 3.3: MP-FQI regression
  • Theorem 3.4: Convergence of MP-FQI
  • Theorem 3.5: Complexity of MP-FQI
  • Remark 3.6: Comparison with standard FQI
  • Proposition 3.7: MP empirical BE II
  • Proposition 4.1: MP empirical variational BE
  • Lemma 4.3
  • ...and 3 more