Table of Contents
Fetching ...

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

Haechan Mark Bong, Rongge Zhang, Ricardo de Azambuja, Giovanni Beltrame

TL;DR

This work tackles safe landing for urban UAVs using a monocular RGB setup augmented by open vocabulary segmentation. It introduces DOVESEI, a ROS 2-based system with a Landing Heatmap Generation Service powered by CLIPSeg and a Main Processing Node that uses a dynamic focus masking to stabilize landing decisions. The dynamic focus significantly improves landing success from $100\,\text{m}$ altitude, achieving 29/50 successes compared to 3/50 without it, and reduces travel time and distance on successful trials, demonstrating strong potential for data-efficient, adaptable aerial safety. The approach minimizes the need for extensive data collection or external guidance, offering a practical path toward robust, onboard safe landings in diverse urban environments, with open-source code available at github.com/MISTLab/DOVESEI.

Abstract

This work targets what we consider to be the foundational step for urban airborne robots, a safe landing. Our attention is directed toward what we deem the most crucial aspect of the safe landing perception stack: segmentation. We present a streamlined reactive UAV system that employs visual servoing by harnessing the capabilities of open vocabulary image segmentation. This approach can adapt to various scenarios with minimal adjustments, bypassing the necessity for extensive data accumulation for refining internal models, thanks to its open vocabulary methodology. Given the limitations imposed by local authorities, our primary focus centers on operations originating from altitudes of 100 meters. This choice is deliberate, as numerous preceding works have dealt with altitudes up to 30 meters, aligning with the capabilities of small stereo cameras. Consequently, we leave the remaining 20m to be navigated using conventional 3D path planning methods. Utilizing monocular cameras and image segmentation, our findings demonstrate the system's capability to successfully execute landing maneuvers at altitudes as low as 20 meters. However, this approach is vulnerable to intermittent and occasionally abrupt fluctuations in the segmentation between frames in a video stream. To address this challenge, we enhance the image segmentation output by introducing what we call a dynamic focus: a masking mechanism that self adjusts according to the current landing stage. This dynamic focus guides the control system to avoid regions beyond the drone's safety radius projected onto the ground, thus mitigating the problems with fluctuations. Through the implementation of this supplementary layer, our experiments have reached improvements in the landing success rate of almost tenfold when compared to global segmentation. All the source code is open source and available online (github.com/MISTLab/DOVESEI).

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

TL;DR

This work tackles safe landing for urban UAVs using a monocular RGB setup augmented by open vocabulary segmentation. It introduces DOVESEI, a ROS 2-based system with a Landing Heatmap Generation Service powered by CLIPSeg and a Main Processing Node that uses a dynamic focus masking to stabilize landing decisions. The dynamic focus significantly improves landing success from altitude, achieving 29/50 successes compared to 3/50 without it, and reduces travel time and distance on successful trials, demonstrating strong potential for data-efficient, adaptable aerial safety. The approach minimizes the need for extensive data collection or external guidance, offering a practical path toward robust, onboard safe landings in diverse urban environments, with open-source code available at github.com/MISTLab/DOVESEI.

Abstract

This work targets what we consider to be the foundational step for urban airborne robots, a safe landing. Our attention is directed toward what we deem the most crucial aspect of the safe landing perception stack: segmentation. We present a streamlined reactive UAV system that employs visual servoing by harnessing the capabilities of open vocabulary image segmentation. This approach can adapt to various scenarios with minimal adjustments, bypassing the necessity for extensive data accumulation for refining internal models, thanks to its open vocabulary methodology. Given the limitations imposed by local authorities, our primary focus centers on operations originating from altitudes of 100 meters. This choice is deliberate, as numerous preceding works have dealt with altitudes up to 30 meters, aligning with the capabilities of small stereo cameras. Consequently, we leave the remaining 20m to be navigated using conventional 3D path planning methods. Utilizing monocular cameras and image segmentation, our findings demonstrate the system's capability to successfully execute landing maneuvers at altitudes as low as 20 meters. However, this approach is vulnerable to intermittent and occasionally abrupt fluctuations in the segmentation between frames in a video stream. To address this challenge, we enhance the image segmentation output by introducing what we call a dynamic focus: a masking mechanism that self adjusts according to the current landing stage. This dynamic focus guides the control system to avoid regions beyond the drone's safety radius projected onto the ground, thus mitigating the problems with fluctuations. Through the implementation of this supplementary layer, our experiments have reached improvements in the landing success rate of almost tenfold when compared to global segmentation. All the source code is open source and available online (github.com/MISTLab/DOVESEI).
Paper Structure (12 sections, 2 equations, 6 figures, 1 table)

This paper contains 12 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our safe-landing system was implemented in ROS 2 and it is composed of three main blocks: UAV (flight controller, sensors), landing heatmap generation (receives a RGB image and produces a heatmap of the best places to land), and main processing node (orchestrates the data exchange with the UAV, sends velocity of landing commands).
  • Figure 2: The visual servoing system takes as input raw segmentation heatmaps (pixels with classes considered good to land on), averages them over time (UAV's max. bank angle is limited, constraining its max. horizontal speed), creates a distance map, applies our dynamical focus masking method, and finally the objective function, Eq. \ref{['eqn:bestpixel']}, to decide on the best pixel.
  • Figure 3: The focus mask radius (R in the illustration above) continuously varies (Eq. \ref{['eqn:dynamic_focus']}), expanding or shrinking, according to the current state of the system. Its minimum size is limited by the UAV's projection on the ground (multiplier factor 6X for Aiming and 2X for Landing), while its upper limit is when the image is inscribed in the circle.
  • Figure 4: Satellite image of Paris showing the latitude and longitude bounding box used to uniformly sample the 50 starting positions for our experiments (red box, dashed) GoogleMaps2023
  • Figure 5: Example of a successful landing approach (from lat. / lon. 48.83948619062335 / 2.296169442158885 to 48.83922328688019 / 2.2948090593103774, red circle to blue star). From right to left: initial UAV's view (alt. 100m), zoom out with alt. 300m (trajectory in red, yellow dashes initial view), final location (alt. 100m) and final UAV's view.
  • ...and 1 more figures