Table of Contents
Fetching ...

Foundational Policy Acquisition via Multitask Learning for Motor Skill Generation

Satoshi Yamamori, Jun Morimoto

TL;DR

The paper tackles rapid motor-skill generation under implicitly changing tasks by introducing a three-phase multitask reinforcement learning framework that learns a foundational policy via encoder-based context representation. It formalizes contextual MDPs through a variational RL lens, linking entropy regularization to KL-divergence minimization and leveraging a dedicated three-stage workflow: foundational policy acquisition, policy selection, and skill generation, with latent variable optimization via derivative-free methods. Empirical results show superior performance against established meta-RL baselines on standard multi-locomotion tasks and successful novel skill generation on a monopod heading task, including an overhead kicking capability not explicitly trained. The work demonstrates how latent context embedding and policy selection enable efficient adaptation to unseen tasks and environments, offering a path toward scalable, transferable motor skills in robotics, with potential extensions to multi-agent scenarios.

Abstract

In this study, we propose a multitask reinforcement learning algorithm for foundational policy acquisition to generate novel motor skills. \textcolor{\hcolor}{Learning the rich representation of the multitask policy is a challenge in dynamic movement generation tasks because the policy needs to cope with changes in goals or environments with different reward functions or physical parameters. Inspired by human sensorimotor adaptation mechanisms, we developed the learning pipeline to construct the encoder-decoder networks and network selection to facilitate foundational policy acquisition under multiple situations. First, we compared the proposed method with previous multitask reinforcement learning methods in the standard multi-locomotion tasks. The results showed that the proposed approach outperformed the baseline methods. Then, we applied the proposed method to the ball heading task using a monopod robot model to evaluate skill generation performance. The results showed that the proposed method was able to adapt to novel target positions or inexperienced ball restitution coefficients but to acquire a foundational policy network, originally learned for heading motion, which can generate an entirely new overhead kicking skill.

Foundational Policy Acquisition via Multitask Learning for Motor Skill Generation

TL;DR

The paper tackles rapid motor-skill generation under implicitly changing tasks by introducing a three-phase multitask reinforcement learning framework that learns a foundational policy via encoder-based context representation. It formalizes contextual MDPs through a variational RL lens, linking entropy regularization to KL-divergence minimization and leveraging a dedicated three-stage workflow: foundational policy acquisition, policy selection, and skill generation, with latent variable optimization via derivative-free methods. Empirical results show superior performance against established meta-RL baselines on standard multi-locomotion tasks and successful novel skill generation on a monopod heading task, including an overhead kicking capability not explicitly trained. The work demonstrates how latent context embedding and policy selection enable efficient adaptation to unseen tasks and environments, offering a path toward scalable, transferable motor skills in robotics, with potential extensions to multi-agent scenarios.

Abstract

In this study, we propose a multitask reinforcement learning algorithm for foundational policy acquisition to generate novel motor skills. \textcolor{\hcolor}{Learning the rich representation of the multitask policy is a challenge in dynamic movement generation tasks because the policy needs to cope with changes in goals or environments with different reward functions or physical parameters. Inspired by human sensorimotor adaptation mechanisms, we developed the learning pipeline to construct the encoder-decoder networks and network selection to facilitate foundational policy acquisition under multiple situations. First, we compared the proposed method with previous multitask reinforcement learning methods in the standard multi-locomotion tasks. The results showed that the proposed approach outperformed the baseline methods. Then, we applied the proposed method to the ball heading task using a monopod robot model to evaluate skill generation performance. The results showed that the proposed method was able to adapt to novel target positions or inexperienced ball restitution coefficients but to acquire a foundational policy network, originally learned for heading motion, which can generate an entirely new overhead kicking skill.
Paper Structure (30 sections, 25 equations, 15 figures, 3 tables, 3 algorithms)

This paper contains 30 sections, 25 equations, 15 figures, 3 tables, 3 algorithms.

Figures (15)

  • Figure 1: Proposed three-phase multitask learning method. Proposed pipeline comprises three phases: acquisition, selection, and generation. In the foundational policy acquisition phase, multiple candidate policies are generated under multiple tasks. (a) $i$-th candidate policy $\pi_i$ is trained under multiple tasks. (b) Policy selection phase chooses a policy from the policy set $\Pi$ based on the performance index defined in Eq. (\ref{['eq:policy-selection-index']}). (c) Context-related latent variable $z$ was estimated to generate skills to handle the novel tasks: unknown rewards and environmental settings. We showed that the Bayesian optimization method can be efficiently used to update the latent variable in the skill generation.
  • Figure 2: Multitask environments to generate a variety of dynamic movements. (a) Half-Cheetah-Dir domain. Goal is to move forward or backward as fast as possible. (b) Half-Cheetah-Vel domain. Goal is to reach a target velocity finn2017rakelly2019. (c, d, e) Implicit multitask heading. Start position, goal position, and the coefficient of the restitution are varied.
  • Figure 3: Foundational policy acquisition and policy selection. We demonstrated that selecting a foundational policy model according to the learning performance with randomized task parameters is the key to successful adaptation to multiple reward and environmental settings. Furthermore, our proposed method simultaneously acquires the optimal policy for each context and encoder to embed context variables that correspond to the task settings.
  • Figure 4: Network architecture. Our proposed method utilized three neural networks, namely, policy network, encoder network, and $Q$-network which approximate the action-value function. All networks contain two hidden layers, and the numbers on each block represent the layer size. Inputs to the policy network included state $s$ and latent variable $z$. The encoder network considered the context variable $c$ as the input. The $Q$-network recieved state $s$, action $a$, and context $c$ as inputs.
  • Figure 5: Half-Cheetah-Dir
  • ...and 10 more figures