Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic
Shizhe Cai, Zeya Yin, Jayadeep Jacob, Fabio Ramos
TL;DR
Q-STAC addresses sample efficiency and stability in robotic reinforcement learning by unifying SAC with Bayesian MPC. It uses Stein Variational Gradient Descent to refine action-trajectory particles sampled from a learned prior, guided by soft Q-values from short-horizon rollouts. This trajectory-level Bayesian inference reduces long-horizon model bias and eliminates the need for hand-crafted costs. Empirical results across 2D navigation, Kinova manipulation, and real-world fruit-picking demonstrate superior performance, robustness, and data efficiency over both model-free and model-based baselines.
Abstract
Deep reinforcement learning (DRL) often struggles with complex robotic manipulation tasks due to low sample efficiency and biased value estimation. Model-based reinforcement learning (MBRL) improves efficiency by leveraging environment dynamics, with prior work integrating Model Predictive Control (MPC) to enhance policy robustness through online trajectory optimization. However, existing MBRL approaches still suffer from high model bias, task-specific cost function design, and significant computational overhead. To address these challenges, we propose Q-guided Stein Variational Model Predictive Actor-Critic (Q-STAC)--a unified framework that bridges Bayesian MPC and Soft Actor-Critic (SAC). Q-STAC employs Stein Variational Gradient Descent (SVGD) to iteratively optimize action sequences sampled from a learned prior distribution guided by Q-values, thereby eliminating manual cost-function engineering. By performing short-horizon model-predictive rollouts, Q-STAC reduces cumulative prediction errors, improves training stability and reduces computational complexity. Experiments on simulated particle navigation, diverse robotic manipulation tasks, and a real-world fruit-picking scenario demonstrate that Q-STAC consistently achieves superior sample efficiency, stability, and overall performance compared to both model-free and model-based baselines.
