Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing
Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng
TL;DR
This work tackles the challenge of varying task difficulty in multi-task reinforcement learning by introducing Dynamic Depth Routing (D2R), which learns task-specific routing to dynamically skip intermediate modules and allocate more or fewer resources per task. The framework combines a base modular network with a routing network to form a differentiable DAG per task, enabling flexible depth and knowledge sharing. To address off-policy training disparities, the authors propose ResRouting, which preserves useful gradients while avoiding negative transfer, and an automatic route-balancing mechanism that adjusts exploration versus exploitation across tasks via adaptive routing temperatures tied to SAC dynamics. Empirical results on Meta-World show state-of-the-art sample efficiency and final performance, with extensive analyses confirming that routing adapts to task difficulty and that the ablations validate the contribution of each component to overall gains.
Abstract
Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.
