Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning
Minh Vu, Konstantinos Slavakis
TL;DR
This work introduces Gaussian Mixture Q-Functions (GMM-QFs) as parametric surrogates for Q-functions in reinforcement learning, optimized via Riemannian-manifold techniques within a policy-iteration framework. The authors prove that GMM-QFs are universal approximators and integrate Bellman-residual losses into a BR-based PI scheme, achieving competitive performance without experience data and with a smaller parameter footprint than deep networks. The theoretical contributions include contraction properties, error bounds, and convergence results, while numerical tests on standard control tasks demonstrate strong performance and computational efficiency. The approach offers a principled, geometry-informed alternative to nonparametric and deep RL methods, with potential for online deployment and automatic model-size control through sparsification.
Abstract
Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters -- a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices -- are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.
