Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
Ziping Xu, Zifan Xu, Runxuan Jiang, Peter Stone, Ambuj Tewari
TL;DR
This work investigates exploration in multitask reinforcement learning (MTRL) and shows that when task diversity is sufficiently rich, a simple policy-sharing scheme with myopic exploration (notably $\epsilon$-greedy) can achieve polynomial sample complexity across tasks. The authors introduce a multitask MEG framework and a generic policy-sharing algorithm that uses a mixture of exploration policies across all tasks, along with a formal diversity condition and complexity bounds that scale with Bellman-Eluder-type dimensions. They compare multitask versus single-task MEG, proving that diversity can yield substantial gains and even exponential separations in worst-case single-task settings, while remaining robust under typical function-approximation scenarios (linear, tabular). The approach is connected to HER and curriculum learning and validated with synthetic robotic-control experiments where diversity—mirroring automatic curricula—improves sample efficiency and aligns with observed task-prioritization patterns. Overall, the paper provides a theoretical and empirical case that diversity in the task set, coupled with simple myopic exploration, can meaningfully reduce exploration complexity in MTRL and offer insights into the practical success of curriculum-like strategies.
Abstract
Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $ε$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.
