Table of Contents
Fetching ...

Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

TL;DR

This work tackles the challenge of varying task difficulty in multi-task reinforcement learning by introducing Dynamic Depth Routing (D2R), which learns task-specific routing to dynamically skip intermediate modules and allocate more or fewer resources per task. The framework combines a base modular network with a routing network to form a differentiable DAG per task, enabling flexible depth and knowledge sharing. To address off-policy training disparities, the authors propose ResRouting, which preserves useful gradients while avoiding negative transfer, and an automatic route-balancing mechanism that adjusts exploration versus exploitation across tasks via adaptive routing temperatures tied to SAC dynamics. Empirical results on Meta-World show state-of-the-art sample efficiency and final performance, with extensive analyses confirming that routing adapts to task difficulty and that the ablations validate the contribution of each component to overall gains.

Abstract

Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.

Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

TL;DR

This work tackles the challenge of varying task difficulty in multi-task reinforcement learning by introducing Dynamic Depth Routing (D2R), which learns task-specific routing to dynamically skip intermediate modules and allocate more or fewer resources per task. The framework combines a base modular network with a routing network to form a differentiable DAG per task, enabling flexible depth and knowledge sharing. To address off-policy training disparities, the authors propose ResRouting, which preserves useful gradients while avoiding negative transfer, and an automatic route-balancing mechanism that adjusts exploration versus exploitation across tasks via adaptive routing temperatures tied to SAC dynamics. Empirical results on Meta-World show state-of-the-art sample efficiency and final performance, with extensive analyses confirming that routing adapts to task difficulty and that the ablations validate the contribution of each component to overall gains.

Abstract

Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.
Paper Structure (41 sections, 16 equations, 12 figures, 8 tables)

This paper contains 41 sections, 16 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: The easy task drawer-close can be solved by simply pushing at any position of the drawer, so D2R learns to use only 6 modules. However, when tackling the difficult task drawer-open, which involves first grasping the handle and then pulling outward, D2R adapts to utilize 10 modules.
  • Figure 2: Four different levels of routing with $n$ modules. (a) is the basic multi-head approach. (b) routes from several separate networks. (c) establishes connections between adjacent layers. (d) can build any possible connections to form a DAG routing.
  • Figure 3: The structure of D2R contains a base module network (left) with multiple modules and a routing network (right) that generates the routing probabilities $p^i$ for each module to select and combine its routing sources.
  • Figure 4: (a) and (b) illustrate the disparate routing paths between behavior policy and target policy during off-policy training. (c) shows the structure of ResRouting, where routing sources with low probabilities are processed using the $\operatorname{rsg}$ operator.
  • Figure 5: Comparison of D2R against baselines on four benchmark settings.
  • ...and 7 more figures