Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

Ziping Xu; Zifan Xu; Runxuan Jiang; Peter Stone; Ambuj Tewari

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

Ziping Xu, Zifan Xu, Runxuan Jiang, Peter Stone, Ambuj Tewari

TL;DR

This work investigates exploration in multitask reinforcement learning (MTRL) and shows that when task diversity is sufficiently rich, a simple policy-sharing scheme with myopic exploration (notably $\epsilon$-greedy) can achieve polynomial sample complexity across tasks. The authors introduce a multitask MEG framework and a generic policy-sharing algorithm that uses a mixture of exploration policies across all tasks, along with a formal diversity condition and complexity bounds that scale with Bellman-Eluder-type dimensions. They compare multitask versus single-task MEG, proving that diversity can yield substantial gains and even exponential separations in worst-case single-task settings, while remaining robust under typical function-approximation scenarios (linear, tabular). The approach is connected to HER and curriculum learning and validated with synthetic robotic-control experiments where diversity—mirroring automatic curricula—improves sample efficiency and aligns with observed task-prioritization patterns. Overall, the paper provides a theoretical and empirical case that diversity in the task set, coupled with simple myopic exploration, can meaningfully reduce exploration complexity in MTRL and offer insights into the practical success of curriculum-like strategies.

Abstract

Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $ε$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

TL;DR

This work investigates exploration in multitask reinforcement learning (MTRL) and shows that when task diversity is sufficiently rich, a simple policy-sharing scheme with myopic exploration (notably

-greedy) can achieve polynomial sample complexity across tasks. The authors introduce a multitask MEG framework and a generic policy-sharing algorithm that uses a mixture of exploration policies across all tasks, along with a formal diversity condition and complexity bounds that scale with Bellman-Eluder-type dimensions. They compare multitask versus single-task MEG, proving that diversity can yield substantial gains and even exponential separations in worst-case single-task settings, while remaining robust under typical function-approximation scenarios (linear, tabular). The approach is connected to HER and curriculum learning and validated with synthetic robotic-control experiments where diversity—mirroring automatic curricula—improves sample efficiency and aligns with observed task-prioritization patterns. Overall, the paper provides a theoretical and empirical case that diversity in the task set, coupled with simple myopic exploration, can meaningfully reduce exploration complexity in MTRL and offer insights into the practical success of curriculum-like strategies.

Abstract

-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.

Paper Structure (41 sections, 23 theorems, 73 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 41 sections, 23 theorems, 73 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Problem Setup
Proposed Multitask Learning Scenario
Value Function Approximation
Myopic Exploration Design
Multitask RL Algorithm with Policy-Sharing
Connection to Hindsight Experience Replay (HER) and multi-goal RL.
Connection to curriculum learning.
Generic Sample Complexity Guarantee
Multitask Myopic Exploration Gap
Sample Complexity Guarantee
Comparing Single-task and Multitask MEG
Lower Bounding Myopic Exploration Gap
Discussions on the Tabular Case
Implications of Diversity on Robotic Control Environments
...and 26 more sections

Key Result

Theorem 1

Consider a multitask RL problem with MDP set $\mathcal{M}$ and value function class $\mathcal{F}$ such that $\mathcal{M}$ is $(\tilde{\alpha}, \tilde{c})$-diverse. Then Algorithm alg:generic with $\epsilon$-greedy exploration function has a sample-complexity

Figures (4)

Figure 1: A diverse grid-world task set on a long hallway with $N+1$ states. From the left to the right, it represents a single-task and a multitask learning scenario, respectively. The triangles represent the starting state and the stars represent the goal states, where an agent receives a positive reward. The agent can choose to move forward or backward.
Figure 2: (a) BipedalWalker Environment with different stump spacing and heights. (b-c) Boxplots of the log-scaled eigenvalues of sample covariance matrices of the trained embeddings generated by the near optimal policies for different environments. In (b), we take average over environments with the same height while in (c), over the same spacing. (d) Task preference of automatically generated curriculum at 5M and 10M training steps respectively. The red regions are the regions where a task has a higher probability to be sampled.
Figure 3: (b-c) Log-scaled eigenvalues of sample covariance matrices of the trained embeddings generated by the near optimal policies for different environments.
Figure 4: An illustration of why a full-rank set of reward parameters does not achieve diversity. The red arrows are two reward parameters and the star marks the generated state distributions of the optimal policies corresponding to the two rewards at the step $h$. Since both optimal policies only visit state 1, they may not provide a sufficient exploration for the next time step $h+1$.

Theorems & Definitions (48)

Definition 1: MTRL Sample Complexity
Definition 2: Mixture Policy
Definition 3: Multitask Myopic Exploration Gap (Multitask MEG)
Definition 4: Multitask Suboptimality
Definition 5: Diverse Tasks
Theorem 1: Upper Bound for Sample Complexity
Proposition 1
Proposition 2
Definition 6: Linear MDP jin2020provably
Proposition 3: Proposition 2.3 jin2020provably
...and 38 more

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

TL;DR

Abstract

Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (48)