Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

Huilin Yin; Shengkai Su; Yinjia Lin; Pengju Zhen; Karin Festl; Daniel Watzenig

Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

Huilin Yin, Shengkai Su, Yinjia Lin, Pengju Zhen, Karin Festl, Daniel Watzenig

TL;DR

This work addresses the challenge of AGV path planning under sparse rewards by introducing a Random Network Distillation (RND) module to provide intrinsic motivation and integrate it with Proximal Policy Optimization (PPO) in continuous-action, physics-based warehouse simulations. The proposed RND-PPO framework alternates between training a fixed-target RND model and a PPO policy, using prediction-error as intrinsic reward $r_i$ alongside extrinsic reward $r_e$ to form $r = r_e + r_i$ and guide exploration. Empirical results in simple and complex static and dynamic scenes show faster and more stable learning with RND-PPO than PPO alone, achieving higher cumulative rewards with fewer episodes. The approach holds practical value for scalable, reliable AGV deployment in dynamic warehouse environments and can be extended to other RL algorithms and more complex settings. $r_i = \|\hat f(s) - f(s)\|^2$ captures state novelty, enabling intrinsic-driven exploration alongside extrinsic feedback.$

Abstract

With the flourishing development of intelligent warehousing systems, the technology of Automated Guided Vehicle (AGV) has experienced rapid growth. Within intelligent warehousing environments, AGV is required to safely and rapidly plan an optimal path in complex and dynamic environments. Most research has studied deep reinforcement learning to address this challenge. However, in the environments with sparse extrinsic rewards, these algorithms often converge slowly, learn inefficiently or fail to reach the target. Random Network Distillation (RND), as an exploration enhancement, can effectively improve the performance of proximal policy optimization, especially enhancing the additional intrinsic rewards of the AGV agent which is in sparse reward environments. Moreover, most of the current research continues to use 2D grid mazes as experimental environments. These environments have insufficient complexity and limited action sets. To solve this limitation, we present simulation environments of AGV path planning with continuous actions and positions for AGVs, so that it can be close to realistic physical scenarios. Based on our experiments and comprehensive analysis of the proposed method, the results demonstrate that our proposed method enables AGV to more rapidly complete path planning tasks with continuous actions in our environments. A video of part of our experiments can be found at https://youtu.be/lwrY9YesGmw.

Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

TL;DR

alongside extrinsic reward

to form

and guide exploration. Empirical results in simple and complex static and dynamic scenes show faster and more stable learning with RND-PPO than PPO alone, achieving higher cumulative rewards with fewer episodes. The approach holds practical value for scalable, reliable AGV deployment in dynamic warehouse environments and can be extended to other RL algorithms and more complex settings.

captures state novelty, enabling intrinsic-driven exploration alongside extrinsic feedback.$

Abstract

Paper Structure (11 sections, 6 equations, 8 figures, 1 algorithm)

This paper contains 11 sections, 6 equations, 8 figures, 1 algorithm.

INTRODUCTION
AGV Path Planning Environment Model
AGV PATH PLANNING BASED ON RND-PPO
Framework of AGV Path Planning with RND-PPO
Random Network Distillation Model
AGV agent path planning with RND-PPO
Experiments
Experimental Settings
Simple Scene Experiments
Complex Scene Experiments
CONCLUSIONS

Figures (8)

Figure 1: Top view schematic of AGV path planning simulation scenario.
Figure 2: Structure of the proposed RND-PPO.
Figure 3: Behabior of reward and episode length in the simple static scenario. (a) environment cumulative reward of the AGV agent and (b) episode length of the AGV agent.
Figure 4: Behavior of reward and episode length in the simple dynamic scenario. (a) environment cumulative reward of the AGV agent and (b) episode length of the AGV agent.
Figure 5: Complex static environment path planning trajectories. From left to right, the training episodes are $0.25$, $0.5$, $0.75$ and $1.0\cdot 10^6$. (a) corresponds to the PPO and (b) corresponds to the RND-PPO.
...and 3 more figures

Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

TL;DR

Abstract

Random Network Distillation Based Deep Reinforcement Learning for AGV Path Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)