Table of Contents
Fetching ...

Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic

Shizhe Cai, Zeya Yin, Jayadeep Jacob, Fabio Ramos

TL;DR

Q-STAC addresses sample efficiency and stability in robotic reinforcement learning by unifying SAC with Bayesian MPC. It uses Stein Variational Gradient Descent to refine action-trajectory particles sampled from a learned prior, guided by soft Q-values from short-horizon rollouts. This trajectory-level Bayesian inference reduces long-horizon model bias and eliminates the need for hand-crafted costs. Empirical results across 2D navigation, Kinova manipulation, and real-world fruit-picking demonstrate superior performance, robustness, and data efficiency over both model-free and model-based baselines.

Abstract

Deep reinforcement learning (DRL) often struggles with complex robotic manipulation tasks due to low sample efficiency and biased value estimation. Model-based reinforcement learning (MBRL) improves efficiency by leveraging environment dynamics, with prior work integrating Model Predictive Control (MPC) to enhance policy robustness through online trajectory optimization. However, existing MBRL approaches still suffer from high model bias, task-specific cost function design, and significant computational overhead. To address these challenges, we propose Q-guided Stein Variational Model Predictive Actor-Critic (Q-STAC)--a unified framework that bridges Bayesian MPC and Soft Actor-Critic (SAC). Q-STAC employs Stein Variational Gradient Descent (SVGD) to iteratively optimize action sequences sampled from a learned prior distribution guided by Q-values, thereby eliminating manual cost-function engineering. By performing short-horizon model-predictive rollouts, Q-STAC reduces cumulative prediction errors, improves training stability and reduces computational complexity. Experiments on simulated particle navigation, diverse robotic manipulation tasks, and a real-world fruit-picking scenario demonstrate that Q-STAC consistently achieves superior sample efficiency, stability, and overall performance compared to both model-free and model-based baselines.

Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic

TL;DR

Q-STAC addresses sample efficiency and stability in robotic reinforcement learning by unifying SAC with Bayesian MPC. It uses Stein Variational Gradient Descent to refine action-trajectory particles sampled from a learned prior, guided by soft Q-values from short-horizon rollouts. This trajectory-level Bayesian inference reduces long-horizon model bias and eliminates the need for hand-crafted costs. Empirical results across 2D navigation, Kinova manipulation, and real-world fruit-picking demonstrate superior performance, robustness, and data efficiency over both model-free and model-based baselines.

Abstract

Deep reinforcement learning (DRL) often struggles with complex robotic manipulation tasks due to low sample efficiency and biased value estimation. Model-based reinforcement learning (MBRL) improves efficiency by leveraging environment dynamics, with prior work integrating Model Predictive Control (MPC) to enhance policy robustness through online trajectory optimization. However, existing MBRL approaches still suffer from high model bias, task-specific cost function design, and significant computational overhead. To address these challenges, we propose Q-guided Stein Variational Model Predictive Actor-Critic (Q-STAC)--a unified framework that bridges Bayesian MPC and Soft Actor-Critic (SAC). Q-STAC employs Stein Variational Gradient Descent (SVGD) to iteratively optimize action sequences sampled from a learned prior distribution guided by Q-values, thereby eliminating manual cost-function engineering. By performing short-horizon model-predictive rollouts, Q-STAC reduces cumulative prediction errors, improves training stability and reduces computational complexity. Experiments on simulated particle navigation, diverse robotic manipulation tasks, and a real-world fruit-picking scenario demonstrate that Q-STAC consistently achieves superior sample efficiency, stability, and overall performance compared to both model-free and model-based baselines.

Paper Structure

This paper contains 35 sections, 14 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Provides an overview of Q-STAC, which integrates a learned prior policy, short-horizon model predictive rollouts, and SVGD-based trajectory refinement under Q-value guidance. The framework unifies model-free policy learning and model-predictive optimization, enabling efficient action generation without hand-crafted cost functions.
  • Figure 2: Benchmark tasks used for algorithm evaluation: (a–c) 2D navigation tasks with multiple Gaussian obstacles, ranging in difficulty from easy to hard; (d–e) Kinova arm reaching tasks under two settings — without and with fixed obstacles; (f) Kinova pick-and-place task, demonstrating robotic manipulation involving picking and reaching motions; (g) Real-world Kinova arm setup.
  • Figure 3: Performance comparison of reinforcement learning algorithms across multiple control tasks. The figure displays training curves for five distinct algorithms (SAC, S2AC, MBPO, PETS, and Q-STAC) evaluated on the 2D navigation task suite and Kinova manipulation task suite. Each plot shows the normalized cumulative reward values (y-axis) against environmental steps (x-axis) measured in millions. Solid lines representing mean performance and shaded regions indicating standard deviation across 5 seeds.
  • Figure 4: Ablation results evaluating the effectiveness of Soft Q-Guidance on (a) 2D Navigation (Hard) and (b) Reach (Obstacled). The plot shows episodic rewards (y-axis) against environmental steps (x-axis) measured in millions.
  • Figure 5: Ablation results for (a) prior selection and horizon length for Reach Task with Q-STAC and (b) analytical vs. learned dynamics on 2D Navigation (Hard) and Reach (Obstacles) in model based RL.