Table of Contents
Fetching ...

Perturbation-mitigated USV Navigation with Distributionally Robust Reinforcement Learning

Zhaofan Zhang, Minghao Yang, Sihong Xie, Hui Xiong

TL;DR

This paper tackles robust USV navigation under heteroscedastic observational noise by introducing DRIQN, a framework that unifies Distributionally Robust Optimization with Implicit Quantile Networks and a gradient-substitution mechanism. It leverages a replay buffer partitioned into noise-pattern subgroups to address multiple environmental noise sources, formulating a tractable dual quadratic program over subgroup gradients. Extensive simulations show DRIQN surpasses state-of-the-art baselines in success rate, collision avoidance, and efficiency across varying noise conditions, with last-layer gradient substitution providing additional gains over full-network substitution. The work advances risk-sensitive RL for autonomous maritime navigation and lays groundwork for real-world deployment under complex perceptual disturbances.

Abstract

The robustness of Unmanned Surface Vehicles (USV) is crucial when facing unknown and complex marine environments, especially when heteroscedastic observational noise poses significant challenges to sensor-based navigation tasks. Recently, Distributional Reinforcement Learning (DistRL) has shown promising results in some challenging autonomous navigation tasks without prior environmental information. However, these methods overlook situations where noise patterns vary across different environmental conditions, hindering safe navigation and disrupting the learning of value functions. To address the problem, we propose DRIQN to integrate Distributionally Robust Optimization (DRO) with implicit quantile networks to optimize worst-case performance under natural environmental conditions. Leveraging explicit subgroup modeling in the replay buffer, DRIQN incorporates heterogeneous noise sources and target robustness-critical scenarios. Experimental results based on the risk-sensitive environment demonstrate that DRIQN significantly outperforms state-of-the-art methods, achieving +13.51\% success rate, -12.28\% collision rate and +35.46\% for time saving, +27.99\% for energy saving, compared with the runner-up.

Perturbation-mitigated USV Navigation with Distributionally Robust Reinforcement Learning

TL;DR

This paper tackles robust USV navigation under heteroscedastic observational noise by introducing DRIQN, a framework that unifies Distributionally Robust Optimization with Implicit Quantile Networks and a gradient-substitution mechanism. It leverages a replay buffer partitioned into noise-pattern subgroups to address multiple environmental noise sources, formulating a tractable dual quadratic program over subgroup gradients. Extensive simulations show DRIQN surpasses state-of-the-art baselines in success rate, collision avoidance, and efficiency across varying noise conditions, with last-layer gradient substitution providing additional gains over full-network substitution. The work advances risk-sensitive RL for autonomous maritime navigation and lays groundwork for real-world deployment under complex perceptual disturbances.

Abstract

The robustness of Unmanned Surface Vehicles (USV) is crucial when facing unknown and complex marine environments, especially when heteroscedastic observational noise poses significant challenges to sensor-based navigation tasks. Recently, Distributional Reinforcement Learning (DistRL) has shown promising results in some challenging autonomous navigation tasks without prior environmental information. However, these methods overlook situations where noise patterns vary across different environmental conditions, hindering safe navigation and disrupting the learning of value functions. To address the problem, we propose DRIQN to integrate Distributionally Robust Optimization (DRO) with implicit quantile networks to optimize worst-case performance under natural environmental conditions. Leveraging explicit subgroup modeling in the replay buffer, DRIQN incorporates heterogeneous noise sources and target robustness-critical scenarios. Experimental results based on the risk-sensitive environment demonstrate that DRIQN significantly outperforms state-of-the-art methods, achieving +13.51\% success rate, -12.28\% collision rate and +35.46\% for time saving, +27.99\% for energy saving, compared with the runner-up.

Paper Structure

This paper contains 14 sections, 1 theorem, 15 equations, 4 figures, 3 tables.

Key Result

Theorem 1

Let $\theta_m\in\mathbb{R}^{d}$ be the model parameter at iteration $m$ and define the subgroup losses $f_j(\theta)\,,\; j=1,\dots,J.$ Denote $G_m \colon \!\! {=} \nabla_\theta f(\theta_m)\in\mathbb{R}^{J\times d}$, whose $j$‑th row is $\nabla_\theta f_j(\theta_m)^{\!\top}$, and consider the quadrat where $\lambda^\star$ solves the dual quadratic program Hence $\delta_m^\star$ is a linear combina

Figures (4)

  • Figure 1: Proposed DRIQN vs. IQN: Policy trajectories visualization with observational noise from sensor. For IQN-based methods, we investigate adaptive strategy and greedy strategy with details in Section Metrics and Strategy.
  • Figure 2: Overall framework of our method. In (a), the USV receives noisy observations specific to its current natural condition, assuming a consistent noise pattern within one episode due to proximity of environmental condition. Therefore, the replay buffer in (b) accumulates subgroups with distinct noise patterns. This variation has a significant impact on the training of the implicit quantile networks in (c). To enhance robustness, DRIQN employs "Gradient Substitution" in (d), replacing the gradients of the output layer with those computed via DRO.
  • Figure 3: Online evaluations compare three learning-based models and two distributional RL strategies. Performance, safety, and convergence are assessed via three key metrics with cumulative reward curves. Implementation details appear in Experimental Settings. Due to space limit, the other two figures beyond noise level 0.6 are in Appendix B.
  • Figure 4: Qualitative trajectory results (noise variance level = 0.6). The greedy approach avoids risk-sensitive adjustments triggered by noisy observations, unlike the adaptive strategy. Aligning with Table \ref{['tab:standard_metrics']}, the adaptive strategy exhibits higher timeout rates and frequent deviations from complex conditions to mitigate risk, as seen in repeated trials.

Theorems & Definitions (1)

  • Theorem 1: Equivalent dual quadratic program