Table of Contents
Fetching ...

QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing

Grace Zhang, Ayush Jain, Injune Hwang, Shao-Hua Sun, Joseph J. Lim

TL;DR

QMP addresses sample efficiency in multi-task reinforcement learning by enabling selective behavior sharing across tasks through a Q-function-guided mixture of policies. It defines a mixture policy $\pi_i^{\text{mix}}(a|s)=\arg\max_{\pi' in \{\pi_1,...,\pi_N\}} \mathbb{E}_{a\sim\pi'(\cdot|s)} [ Q^{\pi_i}(s,a) ] + \alpha \mathcal{H}[\pi'(\cdot|s)]$, which augments off-policy data collection without biasing the current task's objective. The authors prove convergence guarantees and show that QMP yields complementary performance gains over several MTRL baselines across manipulation, locomotion, and navigation tasks, including scaling benefits with more tasks. Empirically, QMP learns when to share behaviors by adjusting mixture probabilities, reduces suboptimality gaps, and demonstrates robustness and compatibility with existing off-policy algorithms like SAC; future work includes temporally extended sharing and richer cross-task priors.

Abstract

Multi-task reinforcement learning (MTRL) aims to learn several tasks simultaneously for better sample efficiency than learning them separately. Traditional methods achieve this by sharing parameters or relabeled data between tasks. In this work, we introduce a new framework for sharing behavioral policies across tasks, which can be used in addition to existing MTRL methods. The key idea is to improve each task's off-policy data collection by employing behaviors from other task policies. Selectively sharing helpful behaviors acquired in one task to collect training data for another task can lead to higher-quality trajectories, leading to more sample-efficient MTRL. Thus, we introduce a simple and principled framework called Q-switch mixture of policies (QMP) that selectively shares behavior between different task policies by using the task's Q-function to evaluate and select useful shareable behaviors. We theoretically analyze how QMP improves the sample efficiency of the underlying RL algorithm. Our experiments show that QMP's behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available at https://qmp-mtrl.github.io.

QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing

TL;DR

QMP addresses sample efficiency in multi-task reinforcement learning by enabling selective behavior sharing across tasks through a Q-function-guided mixture of policies. It defines a mixture policy , which augments off-policy data collection without biasing the current task's objective. The authors prove convergence guarantees and show that QMP yields complementary performance gains over several MTRL baselines across manipulation, locomotion, and navigation tasks, including scaling benefits with more tasks. Empirically, QMP learns when to share behaviors by adjusting mixture probabilities, reduces suboptimality gaps, and demonstrates robustness and compatibility with existing off-policy algorithms like SAC; future work includes temporally extended sharing and richer cross-task priors.

Abstract

Multi-task reinforcement learning (MTRL) aims to learn several tasks simultaneously for better sample efficiency than learning them separately. Traditional methods achieve this by sharing parameters or relabeled data between tasks. In this work, we introduce a new framework for sharing behavioral policies across tasks, which can be used in addition to existing MTRL methods. The key idea is to improve each task's off-policy data collection by employing behaviors from other task policies. Selectively sharing helpful behaviors acquired in one task to collect training data for another task can lead to higher-quality trajectories, leading to more sample-efficient MTRL. Thus, we introduce a simple and principled framework called Q-switch mixture of policies (QMP) that selectively shares behavior between different task policies by using the task's Q-function to evaluate and select useful shareable behaviors. We theoretically analyze how QMP improves the sample efficiency of the underlying RL algorithm. Our experiments show that QMP's behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available at https://qmp-mtrl.github.io.
Paper Structure (48 sections, 3 theorems, 18 equations, 21 figures, 5 tables, 1 algorithm)

This paper contains 48 sections, 3 theorems, 18 equations, 21 figures, 5 tables, 1 algorithm.

Key Result

Theorem 5.1

Consider $\pi_i^\text{old}$ and its associated Q-function $Q_i$. Apply SAC's policy improvement $\pi_i^\text{old} \to \pi_i$ and then $\pi_i \to \pi_i^\text{mix}$ from Eq. eq:pi_mix_final. Then, $Q^{\pi_i^\text{mix}}(\mathbf{s}_t, \mathbf{a}_t) \geq Q^{\pi_i}(\mathbf{s}_t, \mathbf{a}_t) \geq Q^{\pi_

Figures (21)

  • Figure 1: We propose a sample-efficient MTRL framework that selectively shares behaviors by acting with other task policies for data collection. For example, Drawer Open and Drawer Close can share behaviors performed for grasping drawer handle, while Drawer Open and Door Close share behaviors for approaching the tabletop.
  • Figure 2: Our method (QMP) shares behavior between task policies in the data collection phase using a mixture of these policies. For example, in Task 1, each task policy proposes an action $a_j$. The task-specific Q-switch evaluates each $Q_1(s, a_j)$ and selects the best scored policy to gather reward-labeled data to train $Q_1$ and $\pi_1$. Thus, Task 1 will be boosted by incorporating high-reward shareable behaviors into $\pi_1$ and improving $Q_1$ for subsequent Q-switch evaluations.
  • Figure 3: QMP generalized policy iteration
  • Figure 4: 2D Point Reaching. We visualize the training trajectories of $\pi$ with different sets of task policies (fixed but stochastic) and color each step with the policy that proposed it. (a) The single-task SAC policy cannot reach the goal. (b) With 3 diverse policies ($\color{ForestGreen}{\uparrow} \; \color{red}{\rightarrow} \; \color{violet}{\swarrow}$), QMP often selects other policies, showing the suboptimality gap from $Q$ in Eq. \ref{['eq:sac_policy_improvement']}. (c) When a highly relevant $\color{violet}{\nearrow}$ policy is added, QMP often selects $\color{violet}{\nearrow}$ as it is likely to best optimize the learned Q-function.
  • Figure 5: QMP improves performance using other policies, increasingly so when they are task-relevant (5 seeds).
  • ...and 16 more figures

Theorems & Definitions (7)

  • Definition 4.1: Mixture of Policies
  • Definition 4.2: Q-switch Mixture of Policies: QMP
  • Theorem 5.1: Mixture Soft Policy Improvement
  • Theorem C.1: Mixture Soft Policy Improvement
  • proof
  • Theorem C.2: Mixture Soft Policy Iteration
  • proof