Table of Contents
Fetching ...

Modulating Reservoir Dynamics via Reinforcement Learning for Efficient Robot Skill Synthesis

Zahra Koulaeizadeh, Erhan Oztop

TL;DR

This work proposes a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set.

Abstract

A random recurrent neural network, called a reservoir, can be used to learn robot movements conditioned on context inputs that encode task goals. The Learning is achieved by mapping the random dynamics of the reservoir modulated by context to desired trajectories via linear regression. This makes the reservoir computing (RC) approach computationally efficient as no iterative gradient descent learning is needed. In this work, we propose a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set. This is made possible by using a Reinforcement Learning (RL) module that learns a policy to output context as its actions based on the robot state. Considering that the context dimension is typically low, learning with the RL module is very efficient. We show the validity of the proposed model with systematic experiments on a 2 degrees-of-freedom (DOF) simulated robot that is taught to reach targets, encoded as context, with and without obstacle avoidance constraint. The initial data set includes a set of reaching demonstrations which are learned by the reservoir system. To enable reaching out-of-distribution targets, the RL module is engaged in learning a policy to generate dynamic contexts so that the generated trajectory achieves the desired goal without any learning in the reservoir system. Overall, the proposed model uses an initial learned motor primitive set to efficiently generate diverse motor behaviors guided by the designed reward function. Thus the model can be used as a flexible and effective LfD system where the action repertoire can be extended without new data collection.

Modulating Reservoir Dynamics via Reinforcement Learning for Efficient Robot Skill Synthesis

TL;DR

This work proposes a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set.

Abstract

A random recurrent neural network, called a reservoir, can be used to learn robot movements conditioned on context inputs that encode task goals. The Learning is achieved by mapping the random dynamics of the reservoir modulated by context to desired trajectories via linear regression. This makes the reservoir computing (RC) approach computationally efficient as no iterative gradient descent learning is needed. In this work, we propose a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set. This is made possible by using a Reinforcement Learning (RL) module that learns a policy to output context as its actions based on the robot state. Considering that the context dimension is typically low, learning with the RL module is very efficient. We show the validity of the proposed model with systematic experiments on a 2 degrees-of-freedom (DOF) simulated robot that is taught to reach targets, encoded as context, with and without obstacle avoidance constraint. The initial data set includes a set of reaching demonstrations which are learned by the reservoir system. To enable reaching out-of-distribution targets, the RL module is engaged in learning a policy to generate dynamic contexts so that the generated trajectory achieves the desired goal without any learning in the reservoir system. Overall, the proposed model uses an initial learned motor primitive set to efficiently generate diverse motor behaviors guided by the designed reward function. Thus the model can be used as a flexible and effective LfD system where the action repertoire can be extended without new data collection.

Paper Structure

This paper contains 23 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Schematic of the DARC (Dynamic Adaptive Reservoir Computing) model. In the model, the Reservoir, as well as the Input, Context and Feedback weights are fixed and initialized to appropriate random values. In the first stage, the reservoir is trained by using an initial set of tasks (context, movement trajectory pairs) yielding an output weight matrix, which is fixed thereafter. For an out-of-distribution variant of the learned task or a completely novel task, Stage 2 is engaged where the RL module builds a policy to make online context modulations so that the desired new task goal, captured by the Reward function, can be achieved. After RL is completed, DARC model is able perform the initial task as well as the targeted new task with no additional reservoir training.
  • Figure 2: Experimental setup: (a) Target distribution with known, interpolated, and extrapolated points. (b) Simulation of the 2-DOF robotic arm in the Reacher environment, showing trajectory towards the target while avoiding the obstacle.
  • Figure 3: Reaching task test results over two training sessions for each model. (Left) Final distance to target (mean ± SEM), where DARC outperforms CESN and PPO with 77/128 successful reaches, followed by PPO with 15/128, and CESN with only 2/128. (Right) Path length to target shows that DARC maintains relatively shorter paths compared to CESN and PPO
  • Figure 4: Comparison of 10 test trajectories for the Reaching task. (a) Ground truth from PD controller. (b) CESN model. (c) DARC model. (d) PPO model.
  • Figure 5: Test results over four training sessions for each model for Reaching with Obstacle Avoidance task . (Left) Final distance to target (mean ± SEM) shows that DARC achieves the most accurate positioning with all targets reached (256/256), followed by PPO, which successfully reaches 171 out of 256 targets. CESN has the largest final distance error, with no targets reached (0/256). (Right) Path length to target reveals that DARC maintains efficient trajectories, while PPO exhibits the longest paths.
  • ...and 2 more figures