Table of Contents
Fetching ...

Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao, Yunchao Wei

TL;DR

The paper tackles the data bottleneck for X-ray prohibited item detection by proposing Xsyn, a one-stage synthesis pipeline that uses text-grounded inpainting to generate high-quality synthetic X-ray images without labor-intensive foreground extraction. It introduces two strategies—Cross-Attention Refinement (CAR) to automatically refine synthetic annotations via diffusion cross-attention maps and SAM, and Background Occlusion Modeling (BOM) to simulate realistic occlusions in latent space. Empirical results on PIDray, OPIXray, and HiXray show that synthetic data from Xsyn improves detection performance across multiple detectors, with Xsyn-A achieving the largest gains (e.g., +1.2% mAP on PIDray). The method reduces labeling costs while enhancing dataset realism and diversity, offering practical benefits for training prohibited-item detectors in security scenarios.

Abstract

Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

Taming Generative Synthetic Data for X-ray Prohibited Item Detection

TL;DR

The paper tackles the data bottleneck for X-ray prohibited item detection by proposing Xsyn, a one-stage synthesis pipeline that uses text-grounded inpainting to generate high-quality synthetic X-ray images without labor-intensive foreground extraction. It introduces two strategies—Cross-Attention Refinement (CAR) to automatically refine synthetic annotations via diffusion cross-attention maps and SAM, and Background Occlusion Modeling (BOM) to simulate realistic occlusions in latent space. Empirical results on PIDray, OPIXray, and HiXray show that synthetic data from Xsyn improves detection performance across multiple detectors, with Xsyn-A achieving the largest gains (e.g., +1.2% mAP on PIDray). The method reduces labeling costs while enhancing dataset realism and diversity, offering practical benefits for training prohibited-item detectors in security scenarios.

Abstract

Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

Paper Structure

This paper contains 14 sections, 9 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Analysis of existing X-ray security image synthesis methods. Previous two-stage synthesis methods introduce inevitable labor cost in the first stage (e.g, foreground preparation process), which hinders the efficiency of the whole synthesis pipeline. In contrast, Xsyn is a simple and effective one-stage synthesis pipeline, which can automatically refine the synthetic annotation and enhance the synthetic complexity, thereby generating high-quality synthetic data and eliminating extra labor costs.
  • Figure 2: Qualitative comparisons between L2I generation and grounded inpainting. The background of the L2I-generated image (middle) differs a lot from the real-world baggage (left), which may hinder the detection performance. Therefore, we choose grounded inpainting (right) to retain the background.
  • Figure 3: Cross-Attention Refinement. To obtain the spatial-aligned annotation, we leverage SAM to locate the generated prohibited item based on the rich class-discriminative spatial localization information in the cross-attention map. Please see how the bounding box (blue box) of the generated item is refined.
  • Figure 4: Median Point Sampling. Considering the background in the bounding box may interfere with the refinement, we propose to enhance the localization by sampling median points as foreground points in a recursive manner. Different colors refer to different division levels.
  • Figure 5: Background Occlusion Modeling. BOM performs occlusion through regional recombination in the latent space. For simplicity, we omit other variables and components of the diffusion model since the whole generation process has been elaborated.
  • ...and 4 more figures