Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

Yingru Li; Jiawei Xu; Lei Han; Zhi-Quan Luo

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

Yingru Li, Jiawei Xu, Lei Han, Zhi-Quan Luo

TL;DR

It is theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Abstract

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors associated with an optimal action-value function ($Q^\star$) without the need for conjugacy and follows the greedy policies w.r.t. these approximate posterior samples. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite. Implementing HyperAgent requires minimal code addition to well-established deep RL frameworks like DQN. We theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

TL;DR

Abstract

) without the need for conjugacy and follows the greedy policies w.r.t. these approximate posterior samples. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite. Implementing HyperAgent requires minimal code addition to well-established deep RL frameworks like DQN. We theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Paper Structure (62 sections, 8 theorems, 67 equations, 21 figures, 9 tables, 1 algorithm)

This paper contains 62 sections, 8 theorems, 67 equations, 21 figures, 9 tables, 1 algorithm.

Introduction
Key Contributions
Related Works
Bridging the Gap.
Reinforcement Learning & Hypermodel
Hypermodel
Algorithm design
Theoretical insights and analysis
Empirical studies
Computational results for deep exploration
Results on Atari benchmark
Conclusion and future directions
Additional discussion on related works
Discussion on the algorithmic simplicity and deployment efficiency.
Other Principled Exploration Approaches.
...and 47 more sections

Key Result

Lemma 4.1

For $\tilde{m}_k$ recursively defined in eq:noise-incremenal with ${\mathbf{z}} \sim \mathcal{U}(\mathbb{S}^{M-1})$. For any $k \ge 1$, define the good event of $\varepsilon$-approximation The joint event $\cap_{(s,a) \in \mathcal{S} \times \mathcal{A}} \cap_{k=1}^K \mathcal{G}_{k, sa}(\varepsilon)$ holds with probability at least $1 - \delta$ if $M \simeq \varepsilon^{-2} \log(|\mathcal{S}| |\ma

Figures (21)

Figure 1: This evaluation explores the relationship between the amount of training data required and the model parameters necessary to achieve human-level performance, quantified by a 1.0 IQM score. It is assessed across 26 Atari games using the Interquartile Mean (IQM) metric agarwal2021deep using recent state-of-the-art (SOTA) algorithms. The number of parameters is directly proportional to the computational cost, as they predominantly influence the calculation during each SGD update per interaction step. \ref{['alg:hyperagent']}, denoted by $\star$, achieves a 1.0 IQM score with a comparatively minimal number of interactions and parameters.
Figure 2: Last-layer linear hypermodel.
Figure 3: The metric $\operatorname{Episodes~to~Learn}(N):= \operatorname{avg}\{K|\bar{R}_K \geq 0.99\}$ measures the episodes needed to learn the optimal policy in DeepSea of size $N$, where $\bar{R}_K$ is the return achieved by the agent after $K$ episodes of interaction, averaged over 100 evaluations. The crossmark $\text{\color{red}✗}$ denotes the algorithm's failure to solve the problem within $10^4$ episodes. We conduct experiments on each algorithm with 10 different initial random seeds, presenting each result as a distinct point in the figure. The dashed line for \ref{['alg:hyperagent']}, based on linear regression with an $R^2$ of 0.90, illustrates linear scaling in episode complexity, represented by $\Theta(N)$.
Figure 4: Comparison of \ref{['alg:hyperagent']} on the 8 hardest exploration Atari games with Variational approximation aravindan2021state, LangevinMC ishfaq2024provable, Ensemble+ osband2018randomizedosband2019deep and Rainbow hessel2018rainbow.
Figure 5: Illustration for DeepSea.
...and 16 more figures

Theorems & Definitions (18)

Lemma 4.1: Incremental posterior approximation
Theorem 4.2
Remark 4.3
Lemma 3.1: Contraction mapping
proof : Proof of \ref{['lem:contraction']}
Lemma 4.1: Distributional JL lemma johnson1984extensions
Theorem 4.2: Sequential random projection in adaptive process li2024probability
Remark 4.3
proof : Proof of \ref{['lem:approx']}
Remark 4.4: Reduction
...and 8 more

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

TL;DR

Abstract

Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (18)