Table of Contents
Fetching ...

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang

TL;DR

This work introduces InstructCDS for instruction-driven remote sensing object counting, detection, and segmentation, and EarthInstruct as its first RS benchmark across open-vocabulary, open-ended, and open-subclass settings. It presents InstructSAM, a training-free pipeline that combines LVLM-based instruction understanding, class-agnostic SAM2 mask proposals, and a counting-constrained mask-label matching solver to produce labeled RS objects with near-constant inference time. The approach matches or surpasses specialized baselines in counting and recognition across tasks while avoiding task-specific pre-training and tuning, achieving substantial reductions in output tokens ($~89\%$) and runtime ($>32\%$) relative to direct generation methods. This framework advances versatile, scalable instruction-driven RS analysis and lays groundwork for extending similar capabilities to natural imagery and broader semantic labeling tasks.

Abstract

Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

TL;DR

This work introduces InstructCDS for instruction-driven remote sensing object counting, detection, and segmentation, and EarthInstruct as its first RS benchmark across open-vocabulary, open-ended, and open-subclass settings. It presents InstructSAM, a training-free pipeline that combines LVLM-based instruction understanding, class-agnostic SAM2 mask proposals, and a counting-constrained mask-label matching solver to produce labeled RS objects with near-constant inference time. The approach matches or surpasses specialized baselines in counting and recognition across tasks while avoiding task-specific pre-training and tuning, achieving substantial reductions in output tokens () and runtime () relative to direct generation methods. This framework advances versatile, scalable instruction-driven RS analysis and lays groundwork for extending similar capabilities to natural imagery and broader semantic labeling tasks.

Abstract

Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

Paper Structure

This paper contains 55 sections, 1 equation, 34 figures, 12 tables.

Figures (34)

  • Figure 3: The InstructSAM framework. Given an input image and a structured counting prompt, the LVLM Counter extracts target categories $\{cat_j\}$ (semantic info) and their counts $\{num_j\}$ (quantitative info). Concurrently, the Mask Proposer generates mask proposals $\{mask_i\}$ (visual info). A CLIP model computes the similarity matrix $S$ between mask embeddings (from scaled crops) $\{v_i\}$ and category embeddings $\{t_j\}$. Finally, the Binary Integer Programming (BIP) Solver optimally assigns categories to masks by maximizing summed similarity, subject to the counting constraints, yielding the final recognition results.
  • Figure 5: Inference time as a function of bounding box count for InstructSAM-Qwen and baseline methods. Solid lines indicate linear regressions, and scatter points represent individual samples. The shaded regions for InstructSAM illustrate the time composition of different processing steps. Experiments are conducted on an RTX 4090 GPU.
  • Figure 6: Threshold sensitivity analysis in open-vocabulary setting. Left: Impact on mean metrics. Right: Category-specific $\text{F}_\text{1}$-scores. Dashed lines indicate optimal thresholds maximizing $\text{mF}_\text{1}$.
  • Figure 8: Examples illustrating that SkysenseGPT luo2024skysensegpt and TEOChat irvin2024teochat fail to produce meaningful responses for open-vocabulary, open-ended, and open-subclass prompts. Their responses either lack category name outputs or exhibit looped generation.
  • Figure 9: Qualitative results in open-vocabulary setting. While OWLv2 struggles to distinguish remote sensing objects beyond vehicles, and SegEarth-OV fails to separate foreground objects from the background, InstructSAM demonstrates superior performance in segmenting remote sensing objects.
  • ...and 29 more figures