Table of Contents
Fetching ...

Augmented Bayesian Policy Search

Mahdi Kallel, Debabrota Basu, Riad Akrour, Carlo D'Eramo

TL;DR

This work addresses the challenge of efficiently exploring high-dimensional policy spaces with deterministic policies by integrating reinforcement learning signals into Bayesian Optimization. It introduces the Advantage Mean Function, which injects action-value information via the performance-difference lemma into the GP prior, enabling the posterior gradient to align with the deterministic policy gradient. An adaptive ensemble of $Q$-function estimators is proposed to evaluate and weight critics, managed through a Follow The Regularised Leader scheme, and incorporated into the ABS algorithm within the MPD framework. The approach is validated on MuJoCo locomotion tasks, showing competitive or superior performance to existing direct policy search methods, especially in high-dimensional settings, and offering a scalable, sample-efficient alternative that bridges BO and RL.

Abstract

Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.

Augmented Bayesian Policy Search

TL;DR

This work addresses the challenge of efficiently exploring high-dimensional policy spaces with deterministic policies by integrating reinforcement learning signals into Bayesian Optimization. It introduces the Advantage Mean Function, which injects action-value information via the performance-difference lemma into the GP prior, enabling the posterior gradient to align with the deterministic policy gradient. An adaptive ensemble of -function estimators is proposed to evaluate and weight critics, managed through a Follow The Regularised Leader scheme, and incorporated into the ABS algorithm within the MPD framework. The approach is validated on MuJoCo locomotion tasks, showing competitive or superior performance to existing direct policy search methods, especially in high-dimensional settings, and offering a scalable, sample-efficient alternative that bridges BO and RL.

Abstract

Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.
Paper Structure (23 sections, 6 theorems, 21 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 6 theorems, 21 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.3

For an $(L_r,L_p)$-Lipschitz MDP operating with deterministic $L_\pi$-Lipschitz policies, and $\gamma L_{p}( 1+L_{\pi }) < 1$, we bound the residual term for any policies $\pi_x$ and $\pi_\theta$ as

Figures (6)

  • Figure 1: Behavior of the acquisition function of MPD gibo2 augmented with the advantage mean function (Equation (\ref{['eq:adv_mean']})). The maximum of the acquisition function (blue dot) lies in the direction of the mean of the gradient posterior at $\theta$ (star), i.e. $\mu_\theta$ (violet line). The posterior corrects the mean gradient $\nabla_\theta \widehat{m}_\phi$ (pink line) when the mean function $\widehat{m}_\phi$ does not fit the observations.
  • Figure 2: Evolution of the validation and test scores on some of the MuJoCo tasks. We plot the results of a seed to facilitate the interpretation of our results. We provide the histogram and correlations of these distributions in the Appendix.
  • Figure 3: Evolution of the maximum discounted policy return on the MuJoCo-v4 tasks. We use $5$ random seeds for every algorithm. We report the undiscounted returns in Figure \ref{['fig:undisc_perf']} in the Appendix.
  • Figure 4: Ablation study: Effect of the adaptive aggregation on the performance of ABS. Combining adaptive aggregation and resetting the worst critic outperforms all baselines.
  • Figure 5: Evolution of the maximum undiscounted policy return on the MuJoCo-v4 tasks. We use $5$ random seeds for every algorithm.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Theorem 3.3
  • Corollary 3.3.1
  • Corollary C.0.1
  • proof
  • Theorem C.4
  • proof
  • Lemma C.5
  • proof
  • Theorem C.6
  • proof