Table of Contents
Fetching ...

Monocular Obstacle Avoidance Based on Inverse PPO for Fixed-wing UAVs

Haochen Chai, Meimei Su, Yang Lyu, Zhunga Liu, Chunhui Zhao, Quan Pan

TL;DR

A lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors is proposed.

Abstract

Fixed-wing Unmanned Aerial Vehicles (UAVs) are one of the most commonly used platforms for the burgeoning Low-altitude Economy (LAE) and Urban Air Mobility (UAM), due to their long endurance and high-speed capabilities. Classical obstacle avoidance systems, which rely on prior maps or sophisticated sensors, face limitations in unknown low-altitude environments and small UAV platforms. In response, this paper proposes a lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors. The proposed system employs a single-frame image depth inference module with a streamlined network architecture to ensure real-time obstacle detection, optimized for edge computing devices. After that, a reinforcement learning controller with a novel reward function is designed to balance the target approach and flight trajectory smoothness, satisfying the specific dynamic constraints and stability requirements of a fixed-wing UAV platform. An adaptive entropy adjustment mechanism is introduced to mitigate the exploration-exploitation trade-off inherent in DRL, improving training convergence and obstacle avoidance success rates. Extensive software-in-the-loop and hardware-in-the-loop experiments demonstrate that the proposed framework outperforms other methods in obstacle avoidance efficiency and flight trajectory smoothness and confirm the feasibility of implementing the algorithm on edge devices. The source code is publicly available at \url{https://github.com/ch9397/FixedWing-MonoPPO}.

Monocular Obstacle Avoidance Based on Inverse PPO for Fixed-wing UAVs

TL;DR

A lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors is proposed.

Abstract

Fixed-wing Unmanned Aerial Vehicles (UAVs) are one of the most commonly used platforms for the burgeoning Low-altitude Economy (LAE) and Urban Air Mobility (UAM), due to their long endurance and high-speed capabilities. Classical obstacle avoidance systems, which rely on prior maps or sophisticated sensors, face limitations in unknown low-altitude environments and small UAV platforms. In response, this paper proposes a lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors. The proposed system employs a single-frame image depth inference module with a streamlined network architecture to ensure real-time obstacle detection, optimized for edge computing devices. After that, a reinforcement learning controller with a novel reward function is designed to balance the target approach and flight trajectory smoothness, satisfying the specific dynamic constraints and stability requirements of a fixed-wing UAV platform. An adaptive entropy adjustment mechanism is introduced to mitigate the exploration-exploitation trade-off inherent in DRL, improving training convergence and obstacle avoidance success rates. Extensive software-in-the-loop and hardware-in-the-loop experiments demonstrate that the proposed framework outperforms other methods in obstacle avoidance efficiency and flight trajectory smoothness and confirm the feasibility of implementing the algorithm on edge devices. The source code is publicly available at \url{https://github.com/ch9397/FixedWing-MonoPPO}.

Paper Structure

This paper contains 26 sections, 2 theorems, 30 equations, 14 figures, 2 tables.

Key Result

Lemma 4.1

$H(\pi (s,a))$ is $\eta$-smooth, equipped with the Taylor’s theorem, we have such that where $\eta$ is a coefficient.

Figures (14)

  • Figure 1: Simulation scenarios and fixed-wing UAV model used for training and validating. Full video link: https://youtu.be/DXP54UI2lbE
  • Figure 2: The proposed obstacle avoidance framework for fixed-wing UAVs. A depth map is generated from a monocular RGB image using the method described in bhat2023zoedepth, which is encoded by a lightweight backbone ma2024rewrite to extract visual features. These visual features are concatenated with target features and input into the policy network to generate actions, while the critic network evaluates state values. An adaptive entropy module dynamically adjusts the exploration-exploitation tradeoff during training, and an inverse reward function updates the replay buffer, facilitating continuous policy optimization.
  • Figure 3: Training flight paths. The yellow six-pointed stars represent the targets, the red star indicates the fixed-wing UAV's take-off position, and the purple line represents the expected flight trajectory.
  • Figure 4: The comparison of the impact of different reward functions on obstacle avoidance flight trajectories. The red solid lines represent the flight trajectories of the fixed-wing UAV generated by the decision-making process of deep reinforcement learning (DRL) algorithms. The blue solid lines with arrows represent the expected flight trajectories, which point from the take-off points toward the target points. The green dashed lines represent the inferred depth map during obstacle avoidance maneuvers. (a), (c), and (e) show the obstacle avoidance trajectories generated by the model that only uses $r_{\rm{dis}}$, while (b), (d), and (f) show the obstacle avoidance trajectories produced by the model trained using the proposed reward function.
  • Figure 5: Training cumulative rewards comparison. The solid lines represent the average rewards of our algorithms and baselines per episode, while the shaded areas indicate the variability in the reward accumulation for each method.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 4.1
  • Remark 4.1
  • Lemma 4.1
  • Theorem 4.1
  • Proof 4.1