Table of Contents
Fetching ...

Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding

TL;DR

This work tackles the gap in evaluating and enabling spatial intelligence for Vision-Language Models in UAV navigation. It introduces SpatialSky-Bench and SpatialSkyDataset to benchmark and train UAV-focused spatial reasoning across 13 tasks spanning environmental perception and scene understanding, and presents Sky-VLM, a two-stage (SFT then GRPO-based RFT) model that achieves state-of-the-art results on the benchmark. The approach uses multimodal UAV data, automated QA generation with structured outputs, and task-specific rewards to improve localization, counting, and planning capabilities. The results demonstrate substantial improvements over both open-source and closed-source baselines, underscoring the potential for spatial-aware VLMs to enhance autonomous UAV decision-making in complex environments.

Abstract

Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

TL;DR

This work tackles the gap in evaluating and enabling spatial intelligence for Vision-Language Models in UAV navigation. It introduces SpatialSky-Bench and SpatialSkyDataset to benchmark and train UAV-focused spatial reasoning across 13 tasks spanning environmental perception and scene understanding, and presents Sky-VLM, a two-stage (SFT then GRPO-based RFT) model that achieves state-of-the-art results on the benchmark. The approach uses multimodal UAV data, automated QA generation with structured outputs, and task-specific rewards to improve localization, counting, and planning capabilities. The results demonstrate substantial improvements over both open-source and closed-source baselines, underscoring the potential for spatial-aware VLMs to enhance autonomous UAV decision-making in complex environments.

Abstract

Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.

Paper Structure

This paper contains 14 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of SpatialSky255, 106, 0192, 0, 0-Bench. Our benchmarks are divided into two categories: Environmental Perception and Scene Understanding, covering a total of 13 subcategories. We evaluated the VLM’s spatial intelligence capabilities across these UAV navigation tasks.
  • Figure 2: Distribution of our dataset and benchmark.
  • Figure 3: SpatialSky255, 106, 0192, 0, 0-Dataset Generation Process. Our generation pipeline take multimodal inputs, including RGB images, semantic labels, LiDAR depth data, UAV pose information, and bounding boxes. Using a VLM-based generation method and human expert validation, we automatically generate diverse question-answer pairs for 13 spatial reasoning tasks.
  • Figure 4: Overview of our Sky-VLM. Sky-VLM adopts a two-stage training approach. In the first stage, we involve supervised fine-tuning (SFT) on the entire SpatialSky255, 106, 0192, 0, 0-Dataset to develop the basic spatial reasoning capabilities. In the second stage, we use reinforcement fine-tuning (RFT), utilizing task-specific reward functions to enhance decision-making accuracy for key spatial tasks.
  • Figure 5: Performance of Our Sky-VLM.
  • ...and 2 more figures