Table of Contents
Fetching ...

Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report)

Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li

TL;DR

Falcon introduces a lightweight 0.7B remote sensing vision-language foundation model with a unified image–region–pixel understanding framework. Trained on Falcon_SFT, a 78M-sample, 5.6M-image multi-task dataset, Falcon casts 14 tasks as sequence-to-sequence problems and uses dynamic prompt training to flexibly follow instructions. Across 67 RS datasets and 14 tasks, Falcon demonstrates strong zero-shot and in-dataset performance, surpassing prior RS-VLMs while remaining computationally efficient. The work provides extensive qualitative and quantitative analyses and releases the dataset, code, and weights to foster open, scalable RS-VLM research.

Abstract

This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.

Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report)

TL;DR

Falcon introduces a lightweight 0.7B remote sensing vision-language foundation model with a unified image–region–pixel understanding framework. Trained on Falcon_SFT, a 78M-sample, 5.6M-image multi-task dataset, Falcon casts 14 tasks as sequence-to-sequence problems and uses dynamic prompt training to flexibly follow instructions. Across 67 RS datasets and 14 tasks, Falcon demonstrates strong zero-shot and in-dataset performance, surpassing prior RS-VLMs while remaining computationally efficient. The work provides extensive qualitative and quantitative analyses and releases the dataset, code, and weights to foster open, scalable RS-VLM research.

Abstract

This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.

Paper Structure

This paper contains 38 sections, 13 equations, 29 figures, 24 tables.

Figures (29)

  • Figure 1: An overall performance comparison between Falcon and 10 state-of-the-art models across 14 remote sensing tasks at image, region, and pixel levels. Results demonstrate that Falcon outperformed existing models, showcasing superior and more comprehensive understanding and reasoning capabilities.
  • Figure 2: The overview of Falcon model architecture. Given a single image or an image pair (for the task of change detection), Falcon can follow diverse multi-task instructions, generating a universal textual representation suitable for various remote sensing tasks. As shown in the figure, Falcon correctly distinguishes the category of the given image, provides the spatial bounding boxes/segmentations masks for the given objects and even detects subtle changes across images, highlighting its comprehensive capabilities for remote sensing.
  • Figure 3: An illustrative example of images, their corresponding instructions, and output format of different tasks in Falcon_SFT dataset.
  • Figure 4: Visualization of Falcon's output on tasks of object detection, visual grounding, segmentation, and change detection.
  • Figure I: Overview on the qualitative results of Falcon in 14 tasks.
  • ...and 24 more figures