Table of Contents
Fetching ...

Gaussian-Mixture-Model Q-Functions for Reinforcement Learning by Riemannian Optimization

Minh Vu, Konstantinos Slavakis

TL;DR

Numerical tests show that with no use of experienced data, the proposed design outperforms state-of-the-art methods, even deep Q-networks which use experienced data, on benchmark RL tasks.

Abstract

This paper establishes a novel role for Gaussian-mixture models (GMMs) as functional approximators of Q-function losses in reinforcement learning (RL). Unlike the existing RL literature, where GMMs play their typical role as estimates of probability density functions, GMMs approximate here Q-function losses. The new Q-function approximators, coined GMM-QFs, are incorporated in Bellman residuals to promote a Riemannian-optimization task as a novel policy-evaluation step in standard policy-iteration schemes. The paper demonstrates how the hyperparameters (means and covariance matrices) of the Gaussian kernels are learned from the data, opening thus the door of RL to the powerful toolbox of Riemannian optimization. Numerical tests show that with no use of experienced data, the proposed design outperforms state-of-the-art methods, even deep Q-networks which use experienced data, on benchmark RL tasks.

Gaussian-Mixture-Model Q-Functions for Reinforcement Learning by Riemannian Optimization

TL;DR

Numerical tests show that with no use of experienced data, the proposed design outperforms state-of-the-art methods, even deep Q-networks which use experienced data, on benchmark RL tasks.

Abstract

This paper establishes a novel role for Gaussian-mixture models (GMMs) as functional approximators of Q-function losses in reinforcement learning (RL). Unlike the existing RL literature, where GMMs play their typical role as estimates of probability density functions, GMMs approximate here Q-function losses. The new Q-function approximators, coined GMM-QFs, are incorporated in Bellman residuals to promote a Riemannian-optimization task as a novel policy-evaluation step in standard policy-iteration schemes. The paper demonstrates how the hyperparameters (means and covariance matrices) of the Gaussian kernels are learned from the data, opening thus the door of RL to the powerful toolbox of Riemannian optimization. Numerical tests show that with no use of experienced data, the proposed design outperforms state-of-the-art methods, even deep Q-networks which use experienced data, on benchmark RL tasks.
Paper Structure (7 sections, 1 theorem, 8 equations, 3 figures, 2 algorithms)

This paper contains 7 sections, 1 theorem, 8 equations, 3 figures, 2 algorithms.

Key Result

Proposition 1

Consider a point $\bm{\Omega}^{(j)} \coloneqq ( \bm{\xi}^{(j)}, \mathbf{m}_1^{(j)}, \dots, \mathbf{m}_K^{(j)}, \mathbf{C}_1^{(j)}, \dots, \mathbf{C}_K^{(j)}) \in \mathscr{M}$ (see algo:armijo), and its associated GMM-QF $Q^{(j)}$. Let also $\delta_t \coloneqq g_t + \alpha Q^{(j)} (\mathbf{z}_t^\prim

Figures (3)

  • Figure 1: Inverted-pendulum dataset. Curve markers: \ref{['algo:PI']} with $K=5$: , KLSPI xu07klspi: , OBR onlineBRloss:16: , DQN mnih13dqn: , EM-GMMRL agostini17gmmrl: .
  • Figure 2: Mountain-car dataset. Curve markers: \ref{['algo:PI']} with $K=500$: , others follow \ref{['fig:pendulum-exp']}.
  • Figure 3: Effect of different $K$ in \ref{['algo:PI']} for the setting of \ref{['fig:mountaincar-discrete']}. Curve markers: $K=20$: , $K=50$: , $K=200$: . The curve markers for $K=5$ and $K=500$ follow those of \ref{['fig:pendulum-exp', 'fig:mountaincar-exp']}. The larger the $K$, the richer the hyperparameter space $\mathscr{M}$ and the faster the agent learns through the feedback from the environment, at the expense of increased computational complexity.

Theorems & Definitions (1)

  • Proposition 1: Computing gradients