Table of Contents
Fetching ...

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari, Neha Gour, Abderaouf Behouch, Taimur Hassan, Syed Talal Wasim, Nabil Maalej, Muzammal Naseer, Juergen Gall, Mohammed Bennamoun, Ernesto Damiani, Naoufel Werghi

TL;DR

This work addresses the gap in real-world X-ray baggage defense data by introducing STCray, a 46,642-image multimodal X-ray dataset with detailed captions generated via the STING protocol. Building on STCray, the authors present STING-BEE, a domain-aware vision-language model that unifies scene comprehension, threat localization, visual grounding, and VQA for baggage security. Through multi-task instruction tuning and CT-2-Xray augmentations, STING-BEE achieves strong cross-domain generalization across SIXray, PIDray, and COMPASS-XP, outperforming general-purpose VLMs on multiple metrics. The contributions advance practical threat detection by enabling domain-specific language-driven perception in highly cluttered, cross-vendor X-ray scans, with broad implications for operational CAS systems.

Abstract

Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

TL;DR

This work addresses the gap in real-world X-ray baggage defense data by introducing STCray, a 46,642-image multimodal X-ray dataset with detailed captions generated via the STING protocol. Building on STCray, the authors present STING-BEE, a domain-aware vision-language model that unifies scene comprehension, threat localization, visual grounding, and VQA for baggage security. Through multi-task instruction tuning and CT-2-Xray augmentations, STING-BEE achieves strong cross-domain generalization across SIXray, PIDray, and COMPASS-XP, outperforming general-purpose VLMs on multiple metrics. The contributions advance practical threat detection by enabling domain-specific language-driven perception in highly cluttered, cross-vendor X-ray scans, with broad implications for operational CAS systems.

Abstract

Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Paper Structure

This paper contains 28 sections, 2 equations, 27 figures, 10 tables, 1 algorithm.

Figures (27)

  • Figure 1: (Top) A holistic overview of our STCray, the X-ray dataset with real-world threats and image-text paired data, (Below) A comparison with public datasets in terms of multi-modality, strategic concealment, emerging novel threats, and zero-shot task.
  • Figure 2: Each row compares GPT-4, Gemini-1.5 Pro, and LlaVa-NeXT captions for the input image from the STCray dataset (first column) with the captions generated by our STING protocol (last column). Parts of the captions highlighted in green are correct, in red are wrong, and in blue are ambiguous. GPT-4 and Gemini-1.5 Pro fail to identify any threat items in both scans e.g., blade (1st-row image) and 3D printed gun (2nd-row image), while LlaVa-Next interprets the baggage X-ray images as medical scans (best viewed in zoom).
  • Figure 3: Sample images from our STCray with instance-level annotations and corresponding captions (best viewed in zoom).
  • Figure 4: STING protocol systematically generates captions for X-ray baggage images by selecting the baggage type and threat category (e.g., pliers). It then specifies item location and pose (e.g., corner flat), followed by levels of concealment and clutter, with varying occluding objects and degrees of occlusion and additional normal items. This approach models realistic concealment scenarios, providing detailed information on item location, orientation, and surrounding context to support precise caption generation for each scan.
  • Figure 5: Instance-wise distribution of threat categories in the STCray dataset. Left: Radial plot depicting overall counts; Right: Table summary across train and test sets.
  • ...and 22 more figures