Table of Contents
Fetching ...

Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar

TL;DR

The paper addresses the challenge of sample-efficient learning in ensemble Q-learning by integrating bootstrapping of experiences with multi-head self-attention (MHA) inside each Q-learner, forming a variant of the REDQ/DroQ framework. The method preserves the ensemble structure (with $N$ learners and subset size $M$ for in-target minimization) while incorporating dropout, layer normalization, a fully connected pre-layer, and a multi-head attention module (8 heads, embedding $d \

Abstract

We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model.

Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

TL;DR

The paper addresses the challenge of sample-efficient learning in ensemble Q-learning by integrating bootstrapping of experiences with multi-head self-attention (MHA) inside each Q-learner, forming a variant of the REDQ/DroQ framework. The method preserves the ensemble structure (with learners and subset size for in-target minimization) while incorporating dropout, layer normalization, a fully connected pre-layer, and a multi-head attention module (8 heads, embedding $d \

Abstract

We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model.
Paper Structure (5 sections, 4 equations, 8 figures, 1 algorithm)

This paper contains 5 sections, 4 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Our modified REDQ approach incorporates both the bootstrapping and MHA mechanisms. Both the states and the actions are first concatenated, and then multiple bootstrapped samples are drawn from the replay buffer for the ensemble of Q-learners. Individual Q-learners incorporate a fully connected layer and multi-head attention on top of a Q-network. The Q-network integrates elements from both the REDQ and DroQ implementations.
  • Figure 2: An evaluation of Q-value prediction, average normalized Q-bias, and standard deviation of normalized Q-bias: REDQ vs. our proposed approach. Results are based on three separate runs for each environment. Our method demonstrates enhanced Q-value predictions while effectively managing estimation bias at a level comparable to REDQ.
  • Figure 3: Comparison of our method with identity connections and bootstrapping vs. DroQ. Our approach improves over DroQ in all the environments. Results were obtained from three distinct runs.
  • Figure 4: Comparison of Q-value prediction, average normalized bias, and standard deviation of normalized bias between REDQ ($G = 20, N = 10$) and our approach when UTD is set to $G = 10$ while $N = 5$ for our approach. Our method still managed to achieve better performance than REDQ. Results were obtained from three distinct runs for each environment.
  • Figure 5: Comparison of max Q-values prediction distribution. Our method achieves better prediction distribution while simultaneously reducing the number of outliers, except in the case of Walker.
  • ...and 3 more figures