Table of Contents
Fetching ...

Combining RL and IL using a dynamic, performance-based modulation over learning signals and its application to local planning

Francisco Leiva, Javier Ruiz-del-Solar

TL;DR

This work addresses the inefficiency of pure RL in robotic local planning by blending RL with imitation-based signals through a dynamic, performance-based modulation (PModL). A performance estimate $z \in [0,1]$ and a gradient-based balancing factor $\lambda$ arbitrate the influence of $J_{RL}$ and $J_{IL}$ during training, enabling a smooth shift from IL/IIL to RL. The authors instantiate this framework with DDPG plus BC or COACH signals and validate it in simulation and on a real skid-steered robot, reporting up to roughly 4x improvements in sample efficiency and a higher average success rate (0.959) than pure RL or IL baselines. The results also show good sim-to-real transfer, suggesting the method's broad applicability to problems where online IL/IIL signals can be synthesized, and highlight the method’s robustness to reward design and task variations.

Abstract

This paper proposes a method to combine reinforcement learning (RL) and imitation learning (IL) using a dynamic, performance-based modulation over learning signals. The proposed method combines RL and behavioral cloning (IL), or corrective feedback in the action space (interactive IL/IIL), by dynamically weighting the losses to be optimized, taking into account the backpropagated gradients used to update the policy and the agent's estimated performance. In this manner, RL and IL/IIL losses are combined by equalizing their impact on the policy's updates, while modulating said impact such that IL signals are prioritized at the beginning of the learning process, and as the agent's performance improves, the RL signals become progressively more relevant, allowing for a smooth transition from pure IL/IIL to pure RL. The proposed method is used to learn local planning policies for mobile robots, synthesizing IL/IIL signals online by means of a scripted policy. An extensive evaluation of the application of the proposed method to this task is performed in simulations, and it is empirically shown that it outperforms pure RL in terms of sample efficiency (achieving the same level of performance in the training environment utilizing approximately 4 times less experiences), while consistently producing local planning policies with better performance metrics (achieving an average success rate of 0.959 in an evaluation environment, outperforming pure RL by 12.5% and pure IL by 13.9%). Furthermore, the obtained local planning policies are successfully deployed in the real world without performing any major fine tuning. The proposed method can extend existing RL algorithms, and is applicable to other problems for which generating IL/IIL signals online is feasible. A video summarizing some of the real world experiments that were conducted can be found in https://youtu.be/mZlaXn9WGzw.

Combining RL and IL using a dynamic, performance-based modulation over learning signals and its application to local planning

TL;DR

This work addresses the inefficiency of pure RL in robotic local planning by blending RL with imitation-based signals through a dynamic, performance-based modulation (PModL). A performance estimate and a gradient-based balancing factor arbitrate the influence of and during training, enabling a smooth shift from IL/IIL to RL. The authors instantiate this framework with DDPG plus BC or COACH signals and validate it in simulation and on a real skid-steered robot, reporting up to roughly 4x improvements in sample efficiency and a higher average success rate (0.959) than pure RL or IL baselines. The results also show good sim-to-real transfer, suggesting the method's broad applicability to problems where online IL/IIL signals can be synthesized, and highlight the method’s robustness to reward design and task variations.

Abstract

This paper proposes a method to combine reinforcement learning (RL) and imitation learning (IL) using a dynamic, performance-based modulation over learning signals. The proposed method combines RL and behavioral cloning (IL), or corrective feedback in the action space (interactive IL/IIL), by dynamically weighting the losses to be optimized, taking into account the backpropagated gradients used to update the policy and the agent's estimated performance. In this manner, RL and IL/IIL losses are combined by equalizing their impact on the policy's updates, while modulating said impact such that IL signals are prioritized at the beginning of the learning process, and as the agent's performance improves, the RL signals become progressively more relevant, allowing for a smooth transition from pure IL/IIL to pure RL. The proposed method is used to learn local planning policies for mobile robots, synthesizing IL/IIL signals online by means of a scripted policy. An extensive evaluation of the application of the proposed method to this task is performed in simulations, and it is empirically shown that it outperforms pure RL in terms of sample efficiency (achieving the same level of performance in the training environment utilizing approximately 4 times less experiences), while consistently producing local planning policies with better performance metrics (achieving an average success rate of 0.959 in an evaluation environment, outperforming pure RL by 12.5% and pure IL by 13.9%). Furthermore, the obtained local planning policies are successfully deployed in the real world without performing any major fine tuning. The proposed method can extend existing RL algorithms, and is applicable to other problems for which generating IL/IIL signals online is feasible. A video summarizing some of the real world experiments that were conducted can be found in https://youtu.be/mZlaXn9WGzw.
Paper Structure (35 sections, 14 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 14 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Diagram of the problem addressed in this work. The velocity commands and odometry-based estimations are referenced to the robot's local frame. The linear velocity of the robot, $v_x$, goes along the $X$-axis, and its angular velocity, $v_\theta$, around the $Z$-axis. The navigation target is defined in polar coordinates as $(\rho_\text{target}, \theta_\text{target})$, also with respect to the robot's local frame. The 2D range measurements are represented as yellow dots.
  • Figure 2: Diagram of the actor and critic architectures. The layers are described by the operation they perform or by a "Type(Parameters) Activation Function" notation. "Fc" stands for fully connected, and its parameter to the number of units it has. "MLP" stands for multilayer perceptron, and each of its parameters corresponds to the number of units of each fully connected layer that conforms the MLP. Finally, "LReLU" stands for Leaky ReLU and "Tanh" for hyperbolic tangent.
  • Figure 3: Top view illustration of the Husky A200TM, its dimensions, local frame, and the field of view of the 2D LiDARs installed on the robot.
  • Figure 4: Displacement histograms obtained for the positions estimated at $t+k$ and $t$, taking the pose at $t+k$ as the reference coordinate system.
  • Figure 5: Environments utilized to train, evaluate and validate the performance of agents in simulations. Both environments are 16 m long by 16 m wide.
  • ...and 6 more figures