Table of Contents
Fetching ...

Barrier Functions Inspired Reward Shaping for Reinforcement Learning

Nilaksh Nilaksh, Abhishek Ranjan, Shreenabh Agrawal, Aayush Jain, Pushpak Jagtap, Shishir Kolathaya

TL;DR

This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks, and demonstrates better control and dynamics of the bot with the authors' reward framework.

Abstract

Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks. To evaluate the effectiveness of the proposed reward formulations, we conduct simulation experiments on CartPole, Ant, and Humanoid environments, along with real-world deployment on the Unitree Go1 quadruped robot. Our results demonstrate that our method leads to 1.4-2.8 times faster convergence and as low as 50-60% actuation effort compared to the vanilla reward. In a sim-to-real experiment with the Go1 robot, we demonstrated better control and dynamics of the bot with our reward framework.

Barrier Functions Inspired Reward Shaping for Reinforcement Learning

TL;DR

This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks, and demonstrates better control and dynamics of the bot with the authors' reward framework.

Abstract

Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks. To evaluate the effectiveness of the proposed reward formulations, we conduct simulation experiments on CartPole, Ant, and Humanoid environments, along with real-world deployment on the Unitree Go1 quadruped robot. Our results demonstrate that our method leads to 1.4-2.8 times faster convergence and as low as 50-60% actuation effort compared to the vanilla reward. In a sim-to-real experiment with the Go1 robot, we demonstrated better control and dynamics of the bot with our reward framework.
Paper Structure (16 sections, 11 equations, 8 figures, 1 table)

This paper contains 16 sections, 11 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: An example run of Humanoid with vanilla reward (Top) and the exponential barrier reward (Bottom), trained for the same number of time steps. Unlike the vanilla reward, barrier function-based reward leads to more natural and less wasteful movements.
  • Figure 2: Plots illustrating the proposed barrier functions \ref{['eq:quad-cbf']}-\ref{['eq:exp-cbf']}. Dashed lines represent the constraint limits (-1,1).
  • Figure 3: Density plot of the quadratic BF reward, $r^\text{BF}_\text{quad}$ constructed using \ref{['eq:shaping-term1']} and \ref{['eq:quad-cbf']} with (-1,1) as the bounds on $s$. Notice that the reward depends on $s$ and $\dot s$. The solid and dotted contour lines correspond to positive and negative values of the reward, respectively.
  • Figure 4: Energy expended to stabilize initial angles for each reward formulation in the cartpole environment
  • Figure 5: $\theta_p$ and x-position vs time-step graph showing control flow to stabilize -40$^\circ$ pole angle in the cart pole environment. The policy clearly shows that the vanilla policy struggles to stabilize the pole angle close to zero, while the quadratic policy accomplishes this in a few time-steps.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition III.1
  • Remark IV.1