Table of Contents
Fetching ...

XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening

Hongxia Gao, Litao Li, Yixin Chen, Jiali Wen, Kaijie Zhang, Qianyun Liu

Abstract

X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM's poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.

XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening

Abstract

X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM's poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.

Paper Structure

This paper contains 22 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: XSeg data examples. The first row shows X-ray images of contraband from various sources, including but not limited to MobilePhone, Liquid, Scissors, Baton, and Handcuffs. Ground-Truth masks are generated by SAM sam and refined by multiple security experts.
  • Figure 2: Cross-domain chromatic variance phenomenon, the first row is PIDray pidray, the overall color is greenish, the second row is PIXray pixray, the imaging is still greenish, and the third row is HiXray hixray, the color distribution of the luggage is significantly different from PIDray and PIXray.
  • Figure 3: Pipeline of data cleaning. First, images are filtered based on resolution, aspect ratio, and noise levels using Laplacian variance thresholds, effectively removing low-resolution, irregularly proportioned, or excessively noisy samples. The cleaned images are then manually annotated by experts to establish high-precision ground truth masks. These annotations are subsequently used to train an in-house segmentation model, forming an automated labeling system. The model-generated masks undergo multiple rounds of iterative refinement to ensure boundary accuracy, with human verification at each stage. The final output is XSeg.
  • Figure 4: Framework of APSAM. APSAM initializes output tokens by concatenating maximum and minimum grayscale image representations. These are then processed by an Energy-Aware Encoder (EAE) and an MLP based Location Initializer. For the visual encoder backbone, we fine-tune with an identical adapter, keeping other parameters frozen. Point prompts are adaptively generated by our APG, and the mask decoder produces the final segmentation.
  • Figure 5: Energy-Aware Encoder and Location Initializer. The Energy-Aware Encoder comprises three convolutional layers featuring GELU activation and max-pooling. The Location Initializer primarily relies on linear layers and softmax for channel selection, ultimately employing Top-$k$ filtering to identify the optimal token.
  • ...and 3 more figures