Table of Contents
Fetching ...

Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery

Yona Falinie A. Gaus, Neelanjan Bhowmik, Brian K. S. Isaac-Medina, Toby P. Breckon

TL;DR

This study investigates extending SAM's zero-shot segmentation to non-visible spectrum imagery (X-ray and infrared) without retraining. It evaluates three prompting strategies—bounding-box, centroid, and random points—across four public datasets (PIDray, CLCXray, DBF6, FLIR). The findings show bounding-box prompts consistently enable better segmentation, while point prompts are dataset-dependent and often degrade performance, highlighting cross-modal generalization limits and suggesting modality-specific fine-tuning. The work underscores SAM's potential to accelerate annotation in non-visible domains, while signaling practical needs for dataset-specific adaptation to achieve robust segmentation in security and surveillance tasks.

Abstract

The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.

Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery

TL;DR

This study investigates extending SAM's zero-shot segmentation to non-visible spectrum imagery (X-ray and infrared) without retraining. It evaluates three prompting strategies—bounding-box, centroid, and random points—across four public datasets (PIDray, CLCXray, DBF6, FLIR). The findings show bounding-box prompts consistently enable better segmentation, while point prompts are dataset-dependent and often degrade performance, highlighting cross-modal generalization limits and suggesting modality-specific fine-tuning. The work underscores SAM's potential to accelerate annotation in non-visible domains, while signaling practical needs for dataset-specific adaptation to achieve robust segmentation in security and surveillance tasks.

Abstract

The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.
Paper Structure (10 sections, 7 figures, 5 tables)

This paper contains 10 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We propose evaluating three prompting strategies (bounding box - bbox, centroid, random point - randpt) to assess the effectiveness of the Segment Anything Model applied to X-ray and infrared imagery for identifying objects of interest. The bbox prompt yields superior segmentation results, while the other two prompting strategies demonstrate notably higher incorrect/missed predictions.
  • Figure 2: Given an input image, Segment Anything Model (SAM) initiates the process by generating image embeddings via an image encoder. These embeddings are subsequently interactively queried by variational prompts (bounding box, centroid, and random points) in order to generate precise segmentation masks for the objects of interest.
  • Figure 3: Recall performance using variational prompting strategies across different IoU thresholds and IoU type: Bbox.
  • Figure 4: DBF6: Recall performance using variational prompting strategies across different IoU thresholds and IoU types: Bbox (left), Segm (right).
  • Figure 5: IoU distribution for each prompt/dataset pair.
  • ...and 2 more figures