Table of Contents
Fetching ...

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

TL;DR

This study investigates dynamic entropy tuning in reinforcement learning for low-level quadcopter control by comparing a stochastic SAC policy against a deterministic TD3 policy. By dynamically adjusting the entropy coefficient toward a target entropy, SAC aims to sustain exploration and prevent premature convergence. Across small and large simulated environments, SAC with dynamic entropy outperformed TD3, offering faster learning, greater stability, and better generalization, while static entropy or external-noise strategies led to suboptimal performance or catastrophic forgetting. The findings suggest dynamic entropy tuning enhances robustness and adaptability of quadcopter control under varying conditions, with practical implications for real-time, low-level motor command policies.

Abstract

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

TL;DR

This study investigates dynamic entropy tuning in reinforcement learning for low-level quadcopter control by comparing a stochastic SAC policy against a deterministic TD3 policy. By dynamically adjusting the entropy coefficient toward a target entropy, SAC aims to sustain exploration and prevent premature convergence. Across small and large simulated environments, SAC with dynamic entropy outperformed TD3, offering faster learning, greater stability, and better generalization, while static entropy or external-noise strategies led to suboptimal performance or catastrophic forgetting. The findings suggest dynamic entropy tuning enhances robustness and adaptability of quadcopter control under varying conditions, with practical implications for real-time, low-level motor command policies.

Abstract

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

Paper Structure

This paper contains 11 sections, 10 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Block diagram of low-level RPM controller
  • Figure 2: Average mean reward of small environment training for deterministic agent
  • Figure 3: Average mean reward of small environment training for stochastic agent
  • Figure 4: Entropy coefficient ($\alpha$) values of small environment training for stochastic agent
  • Figure 5: Response of deterministic and stochastic agents in small environment from an initial position of [-0.5, 0.5, 1.5]
  • ...and 13 more figures