Table of Contents
Fetching ...

CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

Yexin Li, Pring Wong, Hanfang Zhang, Shuo Chen, Siyuan Qi

TL;DR

This work addresses exploration in deep reinforcement learning by proposing CAE, a lightweight method that repurposes the value-network embeddings to generate exploration bonuses without adding parameters. It integrates linear multi-armed bandit techniques on these embeddings, with a scaling strategy to ensure stability and provable sub-linear regret. An extension, CAE$,+$ adds a small auxiliary network to boost performance in sparse-reward tasks while keeping overhead minimal. Theoretical regret bounds are established, and extensive experiments on MuJoCo and MiniHack demonstrate that CAE and CAE$+$ consistently outperform state-of-the-art baselines, bridging theory and practice in DRL exploration.

Abstract

Exploration remains a critical challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short of practical effectiveness. In this paper, we introduce CAE, a lightweight algorithm that repurposes the value networks in standard deep RL algorithms to drive exploration without introducing additional parameters. CAE utilizes any linear multi-armed bandit technique and incorporates an appropriate scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and practical stability. Notably, it is simple to implement, requiring only around 10 lines of code. In complex tasks where learning an effective value network proves challenging, we propose CAE+, an extension of CAE that incorporates an auxiliary network. This extension increases the parameter count by less than 1% while maintaining implementation simplicity, adding only about 10 additional lines of code. Experiments on MuJoCo and MiniHack show that both CAE and CAE+ outperform state-of-the-art baselines, bridging the gap between theoretical rigor and practical efficiency.

CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

TL;DR

This work addresses exploration in deep reinforcement learning by proposing CAE, a lightweight method that repurposes the value-network embeddings to generate exploration bonuses without adding parameters. It integrates linear multi-armed bandit techniques on these embeddings, with a scaling strategy to ensure stability and provable sub-linear regret. An extension, CAE adds a small auxiliary network to boost performance in sparse-reward tasks while keeping overhead minimal. Theoretical regret bounds are established, and extensive experiments on MuJoCo and MiniHack demonstrate that CAE and CAE consistently outperform state-of-the-art baselines, bridging theory and practice in DRL exploration.

Abstract

Exploration remains a critical challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short of practical effectiveness. In this paper, we introduce CAE, a lightweight algorithm that repurposes the value networks in standard deep RL algorithms to drive exploration without introducing additional parameters. CAE utilizes any linear multi-armed bandit technique and incorporates an appropriate scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and practical stability. Notably, it is simple to implement, requiring only around 10 lines of code. In complex tasks where learning an effective value network proves challenging, we propose CAE+, an extension of CAE that incorporates an auxiliary network. This extension increases the parameter count by less than 1% while maintaining implementation simplicity, adding only about 10 additional lines of code. Experiments on MuJoCo and MiniHack show that both CAE and CAE+ outperform state-of-the-art baselines, bridging the gap between theoretical rigor and practical efficiency.

Paper Structure

This paper contains 27 sections, 6 theorems, 39 equations, 7 figures, 8 tables, 3 algorithms.

Key Result

Theorem 4.2

Suppose the standard assumptions from the literature provable_kernel_nnshallow_exploration hold, $\left \| \boldsymbol{\theta}^{*} \right \|_{2} \leq 1$ and $\left \| (s_{h}; a_{h}) \right \|_{2} \leq 1$. For any $\sigma \in (0, 1)$, assume the number of parameters in each of the $L$ layers of $\phi then with probability at least $1-\sigma$, it holds that: where $C_{1}, C_{2}, C_{3}, C_{4}$ are c

Figures (7)

  • Figure 1: Comparison of existing exploration methods, such as E3B E3B, with CAE and CAE$+$. Here, $L_{b}$ represents the Bellman loss used to update the state- or action-value function, while $L_{f}$ refers to the loss of the auxiliary network. E3B requires additional networks to generate exploration bonuses, while CAE utilizes the embedding layers of the value network for bonuses, resulting in reduced computational overhead and no additional parameters. CAE$+$ extends CAE by incorporating a small auxiliary network to enhance performance in sparse reward environments, with only a minor increase in parameters.
  • Figure 2: CAE$+$ with state-value function. $L_{b}$ represents the Bellman loss used to update the action-value function, while $L_{f}$ refers to the loss of the auxiliary network, which is detailed in \ref{['eq:idn_loss']}.
  • Figure 3: Ablation study to the scaling strategy on MuJoCo. The horizontal axis denotes the number of steps, in multiples of $1e6$.
  • Figure 4: Ablation study to $\boldsymbol{U}$ on MiniHack tasks. The horizontal axis denotes the number of interaction steps, in multiples of $1e7$.
  • Figure 5: Experimental results on MuJoCo-v3.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 4.1
  • Theorem 4.2
  • proof
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • proof
  • Lemma 4.7
  • Lemma 4.8
  • proof