Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

Yu Tang Liu; Aamir Ahmad

Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

Yu Tang Liu, Aamir Ahmad

TL;DR

This paper tackles sample-inefficient online learning in continuous robotic control by introducing a unified online concurrent composition framework that merges successor features-based generalized policy improvement (SF-GPI) with value composition (VC). It develops two SF-based composition rules, SFV and MSF, and extends composition to the action space via Multiplicative Compositional Policy (MCP), culminating in Direct Action Composition (DAC) and DAC-GPI using an impact matrix to map features to actions. Primitives are trained per sub-task and combined online to form policies for new tasks, with theoretical links showing how value-space composition induces corresponding policy-space compositions. Empirical validation on IsaacGym-based Pointmass2D and Pointer benchmarks demonstrates competitive single-task performance with SAC and effective transfer to unseen tasks, while revealing trade-offs due to composition loss and noise, and highlighting the practical feasibility of real-time compositional RL in multi-task robotics. The work provides open-source code and introduces a pathway toward task-agnostic autotelic agents and curriculum-style learning in continuous control.

Abstract

Deep reinforcement learning (DRL) frameworks are increasingly used to solve high-dimensional continuous control tasks in robotics. However, due to the lack of sample efficiency, applying DRL for online learning is still practically infeasible in the robotics domain. One reason is that DRL agents do not leverage the solution of previous tasks for new tasks. Recent work on multi-task DRL agents based on successor features (SFs) has proven to be quite promising in increasing sample efficiency. In this work, we present a new approach that unifies two prior multi-task RL frameworks, SF-GPI and value composition, and adapts them to the continuous control domain. We exploit compositional properties of successor features to compose a policy distribution from a set of primitives without training any new policy. Lastly, to demonstrate the multi-tasking mechanism, we present our proof-of-concept benchmark environments, Pointmass and Pointer, based on IsaacGym, which facilitates large-scale parallelization to accelerate the experiments. Our experimental results show that our multi-task agent has single-task performance on par with soft actor-critic (SAC), and the agent can successfully transfer to new unseen tasks. We provide our code as open-source at "https://github.com/robot-perception-group/concurrent_composition" for the benefit of the community.

Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

TL;DR

Abstract

Paper Structure (32 sections, 26 equations, 5 figures, 3 tables, 5 algorithms)

This paper contains 32 sections, 26 equations, 5 figures, 3 tables, 5 algorithms.

Introduction
Related work
Background
Multi-Task Reinforcement Learning
Maximum Entropy Reinforcement Learning
Soft Policy Iteration
Successor Feature (SF)
Concurrent Composition
Generalized Policy Improvement Composition (GPI)
Value Composition (VC)
Multiplicative Compositional Policy (MCP)
Methodology
Successor Feature-based Composition
Successor Feature based Value Composition (SFV)
Maximum Successor Feature Composition (MSF)
...and 17 more sections

Figures (5)

Figure 1: In concurrent composition, policy extraction is intractable online. Instead, we propose composing the primitives directly in the run time.
Figure 2: network architecture.
Figure 3: Empirical results on Pointmass (top row) and Pointer (bottom row) show that the proposed composition agents can gradually improve transfer performance to unseen tasks and have comparable single-task performance to baseline SAC haarnoja2018soft. SAC is presented in transfer tasks to show that tasks are not generalizable. Each curve represents the mean with two standard deviations of 5 experiments run by the best model from the hyper-parameter tuning. The tuning process is conducted by Bayesian optimization with hundred searches for each agent. Our ablation study indicates that dropout can hurt the performance while other techniques' effects remain unclear, including layer norm, prioritized experience replay, entropy tuning, and activation function.
Figure 4: Visualize primitive and composite distributions by sample trajectories starting from the initial position $X_0$ (red) to the goal at the origin. Composition agents allow solving new tasks by composing existing primitives.
Figure 5: In Pointmass2D-Simple, DAC effectively removes compositional noise (Fig.(a) top right and bottom left) by reducing the corresponding $\kappa$. Each sub-figure represents a 2D plane within $[-20, 20]$ meters.

Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

TL;DR

Abstract

Multi-Task Reinforcement Learning in Continuous Control with Successor Feature-Based Concurrent Composition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)