The Curse of Diversity in Ensemble-Based Exploration

Zhixuan Lin; Pierluca D'Oro; Evgenii Nikishin; Aaron Courville

The Curse of Diversity in Ensemble-Based Exploration

Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin, Aaron Courville

TL;DR

This work reveals a surprising pitfall of ensemble-based exploration: training a diverse, data-sharing ensemble can significantly impair individual members due to off-policy learning from peers’ data. It analyzes the root cause via $p\%$-tandem experiments, showing that learning from predominantly non-self-generated data degrades performance, and demonstrates that naive remedies (e.g., larger replay buffers or smaller ensembles) are not consistently effective. To address this, it introduces Cross-Ensemble Representation Learning (CERL), which adds auxiliary heads to encourage cross-member value learning while preserving ensemble diversity; CERL substantially mitigates the curse in both discrete Atari and continuous MuJoCo domains and improves aggregated policy performance. The findings offer practical caveats for ensemble-based exploration and highlight representation learning as a promising direction for maintaining diversity without sacrificing performance. The work also provides reproducible methodology and open-source implementations for further research.

Abstract

We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents -- a well-established exploration strategy -- can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions -- such as a larger replay buffer or a smaller ensemble size -- either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches.

The Curse of Diversity in Ensemble-Based Exploration

TL;DR

-tandem experiments, showing that learning from predominantly non-self-generated data degrades performance, and demonstrates that naive remedies (e.g., larger replay buffers or smaller ensembles) are not consistently effective. To address this, it introduces Cross-Ensemble Representation Learning (CERL), which adds auxiliary heads to encourage cross-member value learning while preserving ensemble diversity; CERL substantially mitigates the curse in both discrete Atari and continuous MuJoCo domains and improves aggregated policy performance. The findings offer practical caveats for ensemble-based exploration and highlight representation learning as a promising direction for maintaining diversity without sacrificing performance. The work also provides reproducible methodology and open-source implementations for further research.

Abstract

Paper Structure (39 sections, 3 equations, 28 figures, 1 table, 4 algorithms)

This paper contains 39 sections, 3 equations, 28 figures, 1 table, 4 algorithms.

Introduction
Preliminaries
The curse of diversity
The negative effect of ensemble-based exploration
Understanding ensemble performance degradation
Mitigating the curse of diversity: initial attempts
Mitigating the curse of diversity with representation learning
Related work
Discussion and conclusion
Appendix
Algorithms
Experimental details
Atari
MuJoCo
$p\%$-tandem experiments
...and 24 more sections

Figures (28)

Figure 1: Comparison between standard single-agent exploration and ensemble-based exploration. In single-agent training, one agent generates and learns from all the data. In ensemble-based exploration with $N$ ensemble members, each agent generates $1/N$ of the data but learns from all the data.
Figure 2: (top-left) Comparison between Double DQN, Bootstrapped DQN (agg.), and Bootstrapped DQN (indiv.) in $55$ Atari games. Shaded areas show $95\%$ bootstrapped CIs over $5$ seeds. (top-right) Per-game performance improvement of Bootstrapped DQN (indiv.) and Bootstrapped DQN (agg.) over Double DQN, measured as the difference in HNS. All methods use a replay buffer size of $1$M. (bottom) Comparison between SAC, Ensemble SAC (indiv.) and Ensemble SAC (agg.) in $4$ MuJoCo tasks with a replay buffer size of $200$k. Shaded areas show $95\%$ bootstrapped CIs over $30$ seeds. All ensemble methods in this figure use $N=10$ and $L=0$.
Figure 3: (left) Different algorithms as variants of the same ensemble algorithm, using $N=4$ as an example. Each block represented $25\%$ of the generated data. Data blocks of the same colors are generated by identical agents. (middle) Comparison between Double DQN, Bootstrapped DQN (indiv.) with $N=10$ and $L=0$, and the active and passive agents in the $10\%$-tandem setup. All methods use a replay buffer of size $1$M. Shaded areas show $95\%$ bootstrapped CIs. Results are aggregated over $5$ seeds and $55$ games. (right) Correlation between (1) the performance gap between the active and passive agents and (2) the performance gap between Double DQN and Bootstrapped DQN (indiv.) in different games. Each point corresponds to a game. We use Double DQN normalized scores instead of HNS since the scale of the latter can vary a lot across games. Eight games where Double DQN's performance is close to random ($\mathrm{HNS} < 0.05$), and one game whose data point lies on the negative half of the $y$-axis in the plot are omitted since they trivially satisfy our hypothesis.
Figure 4: (left) The effects of replay buffer size in $4$ Atari games. Error bars show $95\%$ bootstrapped CIs over $5$ seeds. (right) The effects of replay buffer size in $4$ MuJoCo tasks. Error bars show $95\%$ bootstrapped CIs over $30$ seeds. We $N=10$ and $L=0$ for Bootstrapped DQN.
Figure 5: (left) The effects of adjusting the ensemble size. We use $L=0$ for Bootstrapped DQN. (right) The effects of varying the number of shared layers. We use $N=10$ for Bootstrapped DQN. The top rows show Double DQN normalized scores. The bottom rows show the entropy of the normalized vote distributions. Error bars show $95\%$ bootstrapped CIs over $5$ seeds. All methods use a replay buffer of size $1$M.
...and 23 more figures

The Curse of Diversity in Ensemble-Based Exploration

TL;DR

Abstract

The Curse of Diversity in Ensemble-Based Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (28)