Table of Contents
Fetching ...

Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning

Minh Vu, Konstantinos Slavakis

TL;DR

This work introduces Gaussian Mixture Q-Functions (GMM-QFs) as parametric surrogates for Q-functions in reinforcement learning, optimized via Riemannian-manifold techniques within a policy-iteration framework. The authors prove that GMM-QFs are universal approximators and integrate Bellman-residual losses into a BR-based PI scheme, achieving competitive performance without experience data and with a smaller parameter footprint than deep networks. The theoretical contributions include contraction properties, error bounds, and convergence results, while numerical tests on standard control tasks demonstrate strong performance and computational efficiency. The approach offers a principled, geometry-informed alternative to nonparametric and deep RL methods, with potential for online deployment and automatic model-size control through sparsification.

Abstract

Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters -- a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices -- are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.

Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning

TL;DR

This work introduces Gaussian Mixture Q-Functions (GMM-QFs) as parametric surrogates for Q-functions in reinforcement learning, optimized via Riemannian-manifold techniques within a policy-iteration framework. The authors prove that GMM-QFs are universal approximators and integrate Bellman-residual losses into a BR-based PI scheme, achieving competitive performance without experience data and with a smaller parameter footprint than deep networks. The theoretical contributions include contraction properties, error bounds, and convergence results, while numerical tests on standard control tasks demonstrate strong performance and computational efficiency. The approach offers a principled, geometry-informed alternative to nonparametric and deep RL methods, with potential for online deployment and automatic model-size control through sparsification.

Abstract

Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters -- a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices -- are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.

Paper Structure

This paper contains 29 sections, 10 theorems, 69 equations, 9 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider $\{ \mathbf{C}_k \}_{k=1}^K \subset \mathbb{S}_{++}^{D_z}$, with $\mathbf{C}_k \neq \mathbf{C}_{k'}$, $\forall k\neq k'$. Let $\sum\nolimits_{k=1}^{K} \beta_k Q_k = 0$, for some $\{ \beta_k \}_{k=1}^K \subset \mathbb{R}$, where $Q_k \in \mathscr{H}_{\mathbf{C}_k}^{\text{pre}} \setminus \{ 0

Figures (9)

  • Figure 1: RL as a sequential decision-making process: at state $\mathbf{s}$, the RL agent takes decision/action $\mathbf{a} \coloneqq \mu( \mathbf{s} )$, suffers the one-step loss $g(\mathbf{s}, \mathbf{a})$, and moves to the next state $\mathbf{s}^{\prime}$ according to some transition probability. Function $\mu(\cdot)$ denotes the policy or decision-making mechanism. The agent seeks to identify a policy that minimizes the cumulative (long-term) loss---quantified by the Q-function $Q(\cdot)$---incurred over its sequence of actions.
  • Figure 2: Policy iteration consists of two steps: policy evaluation and policy improvement. The "exploration" dataset $\mathcal{D}_{\mu_n}[T]$ is collected on the fly under the current policy $\mu_n$ and is distinct from experience data, which are gathered under previous policies and stored in a replay buffer. The proposed framework (\ref{['algo:PI']}) relies exclusively on $\mathcal{D}_{\mu_n}[T]$ and does not use any experience data or a replay buffer.
  • Figure 3: Gradient of the loss $\hat{\mathcal{L}}_{\mu}[T]( \cdot )$ at $\bm{\Omega}^{(j)}$, with $t_j^{\textnormal{A}}$ being the (Armijo) step-size. In general, the gradient is first projected onto the tangent space $T_{ \bm{\Omega}^{(j)} } \mathfrak{M}_K$ and then retracted back to the manifold $\mathfrak{M}_K$. In the present case, however, this projection is unnecessary because, as shown in \ref{['prop:gradients']}, the computed gradient already lies in the tangent space.
  • Figure 4: Control tasks considered in \ref{['sec:tests']}. The dynamics of the systems given above are only used for simulation.
  • Figure 5: Inverted-pendulum dataset. Curve markers: \ref{['algo:PI']} (AffInv) with $K=5$: , KLSPI xu07klspi: , OBR onlineBRloss:16: , DQN mnih13dqn: , Dueling DDQN duelingddqn: , PPO PPO: , EM-GMMRL agostini17gmmrl: .
  • ...and 4 more figures

Theorems & Definitions (22)

  • Theorem 1
  • proof
  • Theorem 3: Universal-approximation properties of GMM-QFs
  • proof
  • Proposition 4
  • proof
  • Example 5: Absil:OptimManifolds:08
  • Example 6: RobbinSalamon:22
  • Proposition 7: Computing gradients
  • proof
  • ...and 12 more