Table of Contents
Fetching ...

Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery

Wei Zhang, Miaoxin Cai, Tong Zhang, Guoqiang Lei, Yin Zhuang, Xuerui Mao

TL;DR

This work tackles the challenge of unified, multi-source ship detection in remote sensing by introducing Popeye, a unified visual-language model that handles HBB, OBB, and pixel-level segmentation through multi-turn language instructions. It combines a hybrid experts encoder for robust multi-scale visual perception, visual-language alignment via LoRA-based fine-tuning on a frozen LLaMA backbone, and an instruction adaption mechanism to transfer natural-scene knowledge to the RS domain, with SAM integrated for segmentation without extra training. The authors construct MMShip, an 81k-instruction dataset that unifies optical and SAR ship data into an image-instruction-answer format and enables cross-domain training. Experiments show Popeye achieves strong zero-shot performance on multiple ship-detection tasks, outperforming specialist detectors and other VLMs, and demonstrates effective pixel-level segmentation when guided by language prompts. This work paves the way for interactive, language-driven maritime monitoring and sets a foundation for broader RS VLM capabilities across multi-source imagery.

Abstract

Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the large language models (LLMs)'s powerful generalization ability, a unified visual-language model called Popeye is proposed for multi-source ship detection from RS imagery. Specifically, to bridge the interpretation gap between the multi-source images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box (HBB) and oriented bounding box (OBB). Subsequently, the hybrid experts encoder is designed to refine multi-scale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pre-trained visual-language knowledge from the nature scene into the RS domain for multi-source ship detection. In addition, the segment anything model (SAM) is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models for zero-shot multi-source ship detection.

Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery

TL;DR

This work tackles the challenge of unified, multi-source ship detection in remote sensing by introducing Popeye, a unified visual-language model that handles HBB, OBB, and pixel-level segmentation through multi-turn language instructions. It combines a hybrid experts encoder for robust multi-scale visual perception, visual-language alignment via LoRA-based fine-tuning on a frozen LLaMA backbone, and an instruction adaption mechanism to transfer natural-scene knowledge to the RS domain, with SAM integrated for segmentation without extra training. The authors construct MMShip, an 81k-instruction dataset that unifies optical and SAR ship data into an image-instruction-answer format and enables cross-domain training. Experiments show Popeye achieves strong zero-shot performance on multiple ship-detection tasks, outperforming specialist detectors and other VLMs, and demonstrates effective pixel-level segmentation when guided by language prompts. This work paves the way for interactive, language-driven maritime monitoring and sets a foundation for broader RS VLM capabilities across multi-source imagery.

Abstract

Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the large language models (LLMs)'s powerful generalization ability, a unified visual-language model called Popeye is proposed for multi-source ship detection from RS imagery. Specifically, to bridge the interpretation gap between the multi-source images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box (HBB) and oriented bounding box (OBB). Subsequently, the hybrid experts encoder is designed to refine multi-scale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pre-trained visual-language knowledge from the nature scene into the RS domain for multi-source ship detection. In addition, the segment anything model (SAM) is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models for zero-shot multi-source ship detection.
Paper Structure (19 sections, 14 equations, 6 figures, 5 tables)

This paper contains 19 sections, 14 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Examples of multi-source (optical/SAR) ship image interpretation by the proposed Popeye in the multi-turn dialogue, including ship detection via OBB or HBB, segmentation, as well as image captioning.
  • Figure 2: (a) Overview of the proposed Popeye. (b) Enhanced visual perception: refining robust multi-scale visual features. (c) Visual-language alignment stage: realizing fundamental visual understanding and image-text mutual interaction. (d) Instruction adaption mechanism: achieving instruction-following ability in the ship domain.
  • Figure 3: Integrated with SAM and the examples of language-referred pixel-level segmentation.
  • Figure 4: Examples of Popeye for ships interpretation from more challenging SAR and optical RS imagery in ShipRSimagenet and DSSDD datasets. From left to right displays the results of Popeye for OBB detection, HBB detection, and ship instance segmentation of small and blurred ship targets.
  • Figure 5: Constructing the MMShip dataset follows the unified labeling paradigm: transforming and annotating the existing multi-source ship detection data into uniform image-instruction-answer format.
  • ...and 1 more figures