Table of Contents
Fetching ...

Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

Shuyang Lin, Tong Jia, Hao Wang, Bowen Ma, Mingyuan Li, Dongyue Chen

TL;DR

Open-vocabulary X-ray prohibited item detection is challenged by domain shift between X-ray images and CLIP pretraining. The authors introduce OVXD, which fine-tunes CLIP with a three-component X-ray feature adapter (XSA, XAA, XIA) within an OVOD framework to detect novel categories beyond trained base classes. Across PIXray and PIDray, OVXD significantly improves novel-category AP over baselines, validating the adapter design and the open-vocabulary approach, with notable gains in AP50 and AP25 for novel items. The method also demonstrates transferability across datasets and robust ablations, signaling practical potential for real-world security screening with scalable category expansion.

Abstract

X-ray prohibited item detection is an essential component of security check and categories of prohibited item are continuously increasing in accordance with the latest laws. Previous works all focus on close-set scenarios, which can only recognize known categories used for training and often require time-consuming as well as labor-intensive annotations when learning novel categories, resulting in limited real-world applications. Although the success of vision-language models (e.g. CLIP) provides a new perspectives for open-set X-ray prohibited item detection, directly applying CLIP to X-ray domain leads to a sharp performance drop due to domain shift between X-ray data and general data used for pre-training CLIP. To address aforementioned challenges, in this paper, we introduce distillation-based open-vocabulary object detection (OVOD) task into X-ray security inspection domain by extending CLIP to learn visual representations in our specific X-ray domain, aiming to detect novel prohibited item categories beyond base categories on which the detector is trained. Specifically, we propose X-ray feature adapter and apply it to CLIP within OVOD framework to develop OVXD model. X-ray feature adapter containing three adapter submodules of bottleneck architecture, which is simple but can efficiently integrate new knowledge of X-ray domain with original knowledge, further bridge domain gap and promote alignment between X-ray images and textual concepts. Extensive experiments conducted on PIXray and PIDray datasets demonstrate that proposed method performs favorably against other baseline OVOD methods in detecting novel categories in X-ray scenario. It outperforms previous best result by 15.2 AP50 and 1.5 AP50 on PIXray and PIDray with achieving 21.0 AP50 and 27.8 AP50 respectively.

Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

TL;DR

Open-vocabulary X-ray prohibited item detection is challenged by domain shift between X-ray images and CLIP pretraining. The authors introduce OVXD, which fine-tunes CLIP with a three-component X-ray feature adapter (XSA, XAA, XIA) within an OVOD framework to detect novel categories beyond trained base classes. Across PIXray and PIDray, OVXD significantly improves novel-category AP over baselines, validating the adapter design and the open-vocabulary approach, with notable gains in AP50 and AP25 for novel items. The method also demonstrates transferability across datasets and robust ablations, signaling practical potential for real-world security screening with scalable category expansion.

Abstract

X-ray prohibited item detection is an essential component of security check and categories of prohibited item are continuously increasing in accordance with the latest laws. Previous works all focus on close-set scenarios, which can only recognize known categories used for training and often require time-consuming as well as labor-intensive annotations when learning novel categories, resulting in limited real-world applications. Although the success of vision-language models (e.g. CLIP) provides a new perspectives for open-set X-ray prohibited item detection, directly applying CLIP to X-ray domain leads to a sharp performance drop due to domain shift between X-ray data and general data used for pre-training CLIP. To address aforementioned challenges, in this paper, we introduce distillation-based open-vocabulary object detection (OVOD) task into X-ray security inspection domain by extending CLIP to learn visual representations in our specific X-ray domain, aiming to detect novel prohibited item categories beyond base categories on which the detector is trained. Specifically, we propose X-ray feature adapter and apply it to CLIP within OVOD framework to develop OVXD model. X-ray feature adapter containing three adapter submodules of bottleneck architecture, which is simple but can efficiently integrate new knowledge of X-ray domain with original knowledge, further bridge domain gap and promote alignment between X-ray images and textual concepts. Extensive experiments conducted on PIXray and PIDray datasets demonstrate that proposed method performs favorably against other baseline OVOD methods in detecting novel categories in X-ray scenario. It outperforms previous best result by 15.2 AP50 and 1.5 AP50 on PIXray and PIDray with achieving 21.0 AP50 and 27.8 AP50 respectively.
Paper Structure (26 sections, 9 equations, 7 figures, 9 tables)

This paper contains 26 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Standard X-ray prohibited item detection requires annotations for all categories during training. (b) Few-shot X-ray prohibited item detection aims to scale the object detector to recognize more categories with only a few samples of novel categories. (c) Open-vocabulary X-ray prohibited item detection uses annotated base categories for training and extends detector to cover unlabeled novel categories which are unknown during the training phase.
  • Figure 2: Comparison of CLIP classification results for common data in a natural scenario and X-ray data in a security inspection scenario reveals a severe domain shift that leads to failure cases of CLIP.
  • Figure 3: Architecture of vanilla ViT block.
  • Figure 4: Three core submodules of X-ray feature adapter used in paper. Adapter submodules are of bottleneck architecture which consist of a down-projection linear layer, a hidden linear layer and an up-projection linear layer. (a) X-ray Space Adapter. (b) X-ray Aggregation Adapter. (c) X-ray Image Adapter. s represents scale factor.
  • Figure 5: An overview of OVXD model. OVXD consists of text and image branch, and open-vocabulary detector is a Faster R-CNN whose classifier is replaced by a linear layer to map region features into word embedding space. OVXD is implemented by applying X-ray feature adapter within CLIP in OVOD framework. Three core adapter submodules of X-ray feature adapter are applied to different positions of the ViT blocks in text and image encoder of CLIP to integrate domain-specific knowledge with original knowledge.
  • ...and 2 more figures