Table of Contents
Fetching ...

Efficient Multi-Task Reinforcement Learning via Task-Specific Action Correction

Jinyuan Feng, Min Chen, Zhiqiang Pu, Tenghai Qiu, Jianqiang Yi

TL;DR

This work tackles generalization and conflict in multi-task reinforcement learning by introducing Task-Specific Action Correction (TSAC), a two-policy framework that separates a shared policy (SP) from an action correction policy (ACP). SP guides learning with dense, task-specific rewards, while ACP uses goal-oriented sparse rewards and a distance-aware editing of actions to achieve long-term cross-task generalization, with a Lagrangian multiplier balancing the objectives. The approach demonstrates notable gains in sample efficiency and final performance on Meta-World MT10 and MT50, outperforming strong baselines and ablations. By enabling a cooperative interaction between SP and ACP and providing a general mechanism to incorporate sparse goals, TSAC offers a practical route to scalable, generalized MTRL in robotic manipulation settings.

Abstract

Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks' details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World's MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution.

Efficient Multi-Task Reinforcement Learning via Task-Specific Action Correction

TL;DR

This work tackles generalization and conflict in multi-task reinforcement learning by introducing Task-Specific Action Correction (TSAC), a two-policy framework that separates a shared policy (SP) from an action correction policy (ACP). SP guides learning with dense, task-specific rewards, while ACP uses goal-oriented sparse rewards and a distance-aware editing of actions to achieve long-term cross-task generalization, with a Lagrangian multiplier balancing the objectives. The approach demonstrates notable gains in sample efficiency and final performance on Meta-World MT10 and MT50, outperforming strong baselines and ablations. By enabling a cooperative interaction between SP and ACP and providing a general mechanism to incorporate sparse goals, TSAC offers a practical route to scalable, generalized MTRL in robotic manipulation settings.

Abstract

Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks' details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World's MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution.
Paper Structure (19 sections, 12 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: a variety of related manipulation tasks. Several actions are similar across these tasks: getting closer to the objects and interacting with them.
  • Figure 2: The structure of TSAC with two policies: a shared policy (SP) and an action correction policy (ACP).
  • Figure 3: The computation graph of Eq.\ref{['bi-level transform']}. Nodes denote variables or networks and edges denote operations. The orange blocks are negative losses, the blue paths are the gradient paths of $\phi$, and the red paths are the gradient paths of $\psi$.
  • Figure 4: The MT10 benchmark from Meta-World contains 10 tasks: reach, push, pick, open window and so on.
  • Figure 5: Training curves of different methods on all benchmarks. The bolded lines represents the mean over 4 runs for both the short horizon and long horizon. The shaded area represents the standard error.
  • ...and 1 more figures