RadDQN: a Deep Q Learning-based Architecture for Finding Time-efficient Minimum Radiation Exposure Pathway

Biswajit Sadhu; Trijit Sadhu; S. Anand

RadDQN: a Deep Q Learning-based Architecture for Finding Time-efficient Minimum Radiation Exposure Pathway

Biswajit Sadhu, Trijit Sadhu, S. Anand

TL;DR

RadDQN tackles the problem of finding time-efficient, minimum-radiation-exposure paths in radiological zones by introducing a radiation-aware reward function and novel exploration strategies within a DQN framework. The reward combines inverse-square exposure from multiple sources with proximity to the exit, formalized as $ r = \frac{n}{R_e} - \sum_i{\frac{\Gamma S_i}{R_{s,i}^{2}}} $ with $\Gamma = 1$ and $n = 1$, and is optimized in a 2D grid-world with 8 actions per step. Ground-truth paths are established via Dijkstra on a graph built from the grid using the same radiation model, and trajectory similarity is quantified with the Fréchet distance $\delta_F$, showing RadDQN achieves faster convergence and greater training stability than vanilla DQN across scenarios with two and three sources and varying strengths. A limitation is the static radiation field during training, with future work including dynamic-source scenarios and real-field robot experiments.

Abstract

Recent advancements in deep reinforcement learning (DRL) techniques have sparked its multifaceted applications in the automation sector. Managing complex decision-making problems with DRL encourages its use in the nuclear industry for tasks such as optimizing radiation exposure to the personnel during normal operating conditions and potential accidental scenarios. However, the lack of efficient reward function and effective exploration strategy thwarted its implementation in the development of radiation-aware autonomous unmanned aerial vehicle (UAV) for achieving maximum radiation protection. Here, in this article, we address these intriguing issues and introduce a deep Q-learning based architecture (RadDQN) that operates on a radiation-aware reward function to provide time-efficient minimum radiation-exposure pathway in a radiation zone. We propose a set of unique exploration strategies that fine-tune the extent of exploration and exploitation based on the state-wise variation in radiation exposure during training. Further, we benchmark the predicted path with grid-based deterministic method. We demonstrate that the formulated reward function in conjugation with adequate exploration strategy is effective in handling several scenarios with drastically different radiation field distributions. When compared to vanilla DQN, our model achieves a superior convergence rate and higher training stability.

RadDQN: a Deep Q Learning-based Architecture for Finding Time-efficient Minimum Radiation Exposure Pathway

TL;DR

with

and

, and is optimized in a 2D grid-world with 8 actions per step. Ground-truth paths are established via Dijkstra on a graph built from the grid using the same radiation model, and trajectory similarity is quantified with the Fréchet distance

, showing RadDQN achieves faster convergence and greater training stability than vanilla DQN across scenarios with two and three sources and varying strengths. A limitation is the static radiation field during training, with future work including dynamic-source scenarios and real-field robot experiments.

Abstract

Paper Structure (20 sections, 3 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 20 sections, 3 equations, 7 figures, 2 tables, 2 algorithms.

Introduction
Methodology
Reinforcement Learning and Vanilla DQN
Our Approach: RadDQN
Simulation Environment
Radiation-aware Reward Function
Defining Optimum Path
Neural Network (NN) Configuration
Exploration vs. Exploitation
Update frequency of Target Network
Finding ground-truth:
Comparison of ground-truth with RadDQN-predicted paths:
Results and discussion
Scenario with two radioactive sources
Scenario with three radioactive sources
...and 5 more sections

Figures (7)

Figure 1: Flowchart of RadDQN algorithm (present study).
Figure 2: Radiation-aware reward function for two sources of unit radiation strength at (2,0) and (7,7). Reward r results from the subtraction of $\frac{n}{R_{e}}$ from $\sum_i{\frac{\Gamma S_i}{R_{s,i}^{2}}}$ following equation \ref{['eq:reward']}. In the figure, the start and exit cell are symbolized as 'S' and 'E', respectively. The value of $\Gamma$ and n are taken as 1.
Figure 3: The result for the two different environments (Case I and Case II) with two sources. The start and exit point in the simulated floor is shown using 'S' and 'E' symbol. Top panel (a) shows the distribution of radiation exposure in the simulated floor. The colorbar indicates the intensity of the radiation. The white dashed line in right figure (Case II) indicates the path connecting the states with least radiation exposure (but with more number of steps). The optimal path (ground truth) obtained from Dijkstra's algorithm is shown in black line, where the black dots in the path indicates the number of steps. Middle Panel (b) shows the density of visited states using RadDQN architecture. The associated contour plot in the figure clearly shows the optimal path similarity with the ground truth. Bottom panel (c) shows the convergence of training in the context of maximizing average reward per step. The dashed line indicates the ground truth value.
Figure 4: The case-studies with three sources at strategic positions. The start and exit point in the simulated floor is shown using 'S' and 'E' symbol. Left panel (a, b and c) shows the distribution of radiation exposure in the simulated floor. The colorbar indicates the intensity of the radiation. The optimal path (ground truth) obtained from Dijkstra's algorithm is shown in black line, where the black dots in the path indicates the number of steps. Right Panel (a, b, c) shows the density of visited states using RadDQN architecture. The associated contour plot in the figure clearly shows the optimal path similarity with the ground truth.
Figure 5: (a) Plot of average cumulative reward vs. number of played episode that shows the convergence of training in the context of maximizing average reward per step. The dashed line indicates the ground truth value (Top panel). (b) Plot of % of win vs. Episodes (bottom left panel). (c) Plot of number of steps in the winning episodes vs. Index of Winning Episodes (bottom right panel).
...and 2 more figures

RadDQN: a Deep Q Learning-based Architecture for Finding Time-efficient Minimum Radiation Exposure Pathway

TL;DR

Abstract

RadDQN: a Deep Q Learning-based Architecture for Finding Time-efficient Minimum Radiation Exposure Pathway

Authors

TL;DR

Abstract

Table of Contents

Figures (7)