Table of Contents
Fetching ...

Average-Reward Maximum Entropy Reinforcement Learning for Underactuated Double Pendulum Tasks

Jean Seong Bjorn Choe, Bumkyu Choi, Jong-kook Kim

TL;DR

The paper tackles swing-up and upright stabilization of underactuated double pendulums (acrobot and pendubot) in a continuing-task setting. It introduces AR-EAPO, a model-free algorithm that blends average-reward reinforcement learning with maximum-entropy exploration by decoupling reward and entropy objectives into a soft bias framework and optimizing via a PPO-style clipped objective. Empirical results in simulation show AR-EAPO achieves superior performance and robustness over baselines like TVLQR and ILQR Riccati across both tasks, with notable gains in swing-up efficiency and perturbation tolerance. The work reduces reward-engineering requirements and demonstrates promising potential for real-world robotic deployment, pending transfer to physical hardware validation.

Abstract

This report presents a solution for the swing-up and stabilisation tasks of the acrobot and the pendubot, developed for the AI Olympics competition at IROS 2024. Our approach employs the Average-Reward Entropy Advantage Policy Optimization (AR-EAPO), a model-free reinforcement learning (RL) algorithm that combines average-reward RL and maximum entropy RL. Results demonstrate that our controller achieves improved performance and robustness scores compared to established baseline methods in both the acrobot and pendubot scenarios, without the need for a heavily engineered reward function or system model. The current results are applicable exclusively to the simulation stage setup.

Average-Reward Maximum Entropy Reinforcement Learning for Underactuated Double Pendulum Tasks

TL;DR

The paper tackles swing-up and upright stabilization of underactuated double pendulums (acrobot and pendubot) in a continuing-task setting. It introduces AR-EAPO, a model-free algorithm that blends average-reward reinforcement learning with maximum-entropy exploration by decoupling reward and entropy objectives into a soft bias framework and optimizing via a PPO-style clipped objective. Empirical results in simulation show AR-EAPO achieves superior performance and robustness over baselines like TVLQR and ILQR Riccati across both tasks, with notable gains in swing-up efficiency and perturbation tolerance. The work reduces reward-engineering requirements and demonstrates promising potential for real-world robotic deployment, pending transfer to physical hardware validation.

Abstract

This report presents a solution for the swing-up and stabilisation tasks of the acrobot and the pendubot, developed for the AI Olympics competition at IROS 2024. Our approach employs the Average-Reward Entropy Advantage Policy Optimization (AR-EAPO), a model-free reinforcement learning (RL) algorithm that combines average-reward RL and maximum entropy RL. Results demonstrate that our controller achieves improved performance and robustness scores compared to established baseline methods in both the acrobot and pendubot scenarios, without the need for a heavily engineered reward function or system model. The current results are applicable exclusively to the simulation stage setup.
Paper Structure (12 sections, 12 equations, 5 figures, 5 tables)

This paper contains 12 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Swing-up trajectory of the acrobot without noise.
  • Figure 2: Swing-up trajectory of the pendubot without noise.
  • Figure 3: Swing-up trajectory of the acrobot with noise.
  • Figure 4: Swing-up trajectory of the pendubot with noise.
  • Figure 5: Robustness results of our controllers for the acrobot and the pendubot.