Table of Contents
Fetching ...

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Zihan Wang, Nina Mahmoudian

TL;DR

This work tackles vision-driven UAV river following in GPS-denied environments by modeling the task as a Constrained Submodular Markov Decision Process with non-Markovian, history-dependent rewards and safety costs. It introduces Marginal Gain Advantage Estimation (MGAE) for trajectory-aware reward optimization, a Semantic Dynamics Model (SDM) that uses patchified water masks with homography-based predictions for interpretable short-term dynamics, and the Constrained Actor Dynamics Estimator (CADE) that fuses MGAE, SDM, and a cost estimator within a model-based SafeRL framework. The approach demonstrates faster learning and improved performance via MGAE, enhanced safety-relevant prediction with SDM, and robust safety integration through the Lagrangian-based CADE and an optional hard safety layer. Together, these components enable safer, data-efficient navigation for UAV river following with potential real-world applicability in rescue, surveillance, and environmental monitoring under challenging, GPS-denied conditions.

Abstract

Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. These safety-critical navigation tasks must satisfy hard safety constraints while optimizing performance. Moreover, the reward in river following is inherently history-dependent (non-Markovian) by which river segment has already been visited, making it challenging for standard safe Reinforcement Learning (SafeRL). To address these gaps, we propose three contributions. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, aligning the advantage estimate with non-Markovian dynamics. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks offering more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach providing a "soft" balance between reward and safety during training, while the safety layer enhances inference by imposing a "hard" action overlay.

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

TL;DR

This work tackles vision-driven UAV river following in GPS-denied environments by modeling the task as a Constrained Submodular Markov Decision Process with non-Markovian, history-dependent rewards and safety costs. It introduces Marginal Gain Advantage Estimation (MGAE) for trajectory-aware reward optimization, a Semantic Dynamics Model (SDM) that uses patchified water masks with homography-based predictions for interpretable short-term dynamics, and the Constrained Actor Dynamics Estimator (CADE) that fuses MGAE, SDM, and a cost estimator within a model-based SafeRL framework. The approach demonstrates faster learning and improved performance via MGAE, enhanced safety-relevant prediction with SDM, and robust safety integration through the Lagrangian-based CADE and an optional hard safety layer. Together, these components enable safer, data-efficient navigation for UAV river following with potential real-world applicability in rescue, surveillance, and environmental monitoring under challenging, GPS-denied conditions.

Abstract

Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. These safety-critical navigation tasks must satisfy hard safety constraints while optimizing performance. Moreover, the reward in river following is inherently history-dependent (non-Markovian) by which river segment has already been visited, making it challenging for standard safe Reinforcement Learning (SafeRL). To address these gaps, we propose three contributions. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, aligning the advantage estimate with non-Markovian dynamics. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks offering more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach providing a "soft" balance between reward and safety during training, while the safety layer enhances inference by imposing a "hard" action overlay.

Paper Structure

This paper contains 22 sections, 22 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Comparison of Marginal Gain Advantage Estimation (MGAE) and Generalized Advantage Estimation (GAE). MGAE uses backward-looking estimation to calculate cumulative marginal gains, while GAE uses forward-looking value function estimation.
  • Figure 2: Observation images in Safe Riverine Environment. From left to right: RGB image, water semantic mask, patchified water semantic mask as RL observation.
  • Figure 3: Constrained Actor Dynamics Estimator (CADE) architecture to solve CSMDP. The reason of choosing Recurrent or Feedforward network for each module is explicitly noted near the module block, and the character each module plays is noted inside the module block.
  • Figure 4: CADE computational graph. a is the current action to be executed, r is the estimated current reward, c is the estimated current cost and o is the predicted next observation. MLP is Multi-Layer Perceptron, GRU is Gated Recurrent Unit. Black arrows denote forward pass. Purple arrows represent the backpropagation pass. Best view in color.
  • Figure 5: Overview of two Constrained Submodular Markov Decision Process (CSMDP) environments, with increasing difficulty level {easy, medium, hard} from left to right.
  • ...and 11 more figures