Table of Contents
Fetching ...

BlabberSeg: Real-Time Embedded Open-Vocabulary Aerial Segmentation

Haechan Mark Bong, Ricardo de Azambuja, Giovanni Beltrame

TL;DR

BlabberSeg is introduced, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs, which improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation.

Abstract

Real-time aerial image segmentation plays an important role in the environmental perception of Uncrewed Aerial Vehicles (UAVs). We introduce BlabberSeg, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs. BlabberSeg improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation. We validated BlabberSeg in a safe landing scenario using the Dynamic Open-Vocabulary Enhanced SafE-Landing with Intelligence (DOVESEI) framework, which uses visual servoing and open-vocabulary segmentation. BlabberSeg reduces computational costs significantly, with a speed increase of 927.41% (16.78 Hz) on a NVIDIA Jetson Orin AGX (64GB) compared with the original CLIPSeg (1.81Hz), achieving real-time aerial segmentation with negligible loss in accuracy (2.1% as the ratio of the correctly segmented area with respect to CLIPSeg). BlabberSeg's source code is open and available online.

BlabberSeg: Real-Time Embedded Open-Vocabulary Aerial Segmentation

TL;DR

BlabberSeg is introduced, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs, which improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation.

Abstract

Real-time aerial image segmentation plays an important role in the environmental perception of Uncrewed Aerial Vehicles (UAVs). We introduce BlabberSeg, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs. BlabberSeg improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation. We validated BlabberSeg in a safe landing scenario using the Dynamic Open-Vocabulary Enhanced SafE-Landing with Intelligence (DOVESEI) framework, which uses visual servoing and open-vocabulary segmentation. BlabberSeg reduces computational costs significantly, with a speed increase of 927.41% (16.78 Hz) on a NVIDIA Jetson Orin AGX (64GB) compared with the original CLIPSeg (1.81Hz), achieving real-time aerial segmentation with negligible loss in accuracy (2.1% as the ratio of the correctly segmented area with respect to CLIPSeg). BlabberSeg's source code is open and available online.

Paper Structure

This paper contains 18 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: BlabberSeg Architecture: In this architecture, we harness OpenCLIP Visual Transformer's capability to cast CLIP into floating-point (FP) 16 and use FP16 converted CLIPSeg Decoder with reused prompt and positional embeddings to accelerate segmentation. P represents the number of prompts, which are the number of words that are selected to describe safe landing zones (e.g. Grass, Lawn, Flat, Park, etc.).
  • Figure 2: DOVESEI bong2023dynamic was implemented in ROS 2 and it is composed of three main blocks: UAV (flight controller, sensors), landing heatmap generation (receives an RGB image and produces a heatmap of the best places to land), and main processing node (orchestrates the data exchange with the UAV, sends velocity commands).
  • Figure 3: Increase in Frequency Through Optimization. Model Legend: Original: CLIPSeg (Original) FP: CLIPSeg (FP16) RPE: CLIPSeg + Reusing Prompt Embeddings FP-RPE: CLIPSeg (FP16) + Reusing Prompt Embeddings HF: CLIPSeg (Hugging Face) HF-RPE: CLIPSeg (Hugging Face) + Reusing Prompt Embeddings FP-RPPE: CLIPSeg (FP16) + Reusing Prompt & Positional Embeddings FP-RPPET: CLIPSeg (FP16) + Reusing Prompt & Positional Embeddings + TensorRT FP-RPPETI: CLIPSeg (FP16) + Reusing Prompt & Positional Embeddings + TensorRT + Input/Output Binding.
  • Figure 4: Duration of Image Transformation before Segmentation.
  • Figure 5: GPU Usage During Optimization.
  • ...and 3 more figures