Table of Contents
Fetching ...

Superpowering Open-Vocabulary Object Detectors for X-ray Vision

Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci

TL;DR

RAXO is a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection, and introduces DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray.

Abstract

Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.

Superpowering Open-Vocabulary Object Detectors for X-ray Vision

TL;DR

RAXO is a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection, and introduces DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray.

Abstract

Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.

Paper Structure

This paper contains 29 sections, 6 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Traditional X-ray object detectors are constrained by the limited categories in their training datasets. (b) We introduce the task of open-vocabulary object detection (OvOD) for X-ray imaging and propose RAXO, a training-free method that adapts off-the-shelf RGB OvOD models to X-ray data. (c) RAXO greatly improves detection performance across multiple benchmarks.
  • Figure 2: Architecture of RAXO. For a given user-defined class $c \in \mathcal{C}^{\mathrm{test}}$, RAXO first retrieves its corresponding X-ray images $\mathcal{G}_{c}^{XRAY}$ from in-house and web sources, using its (1)Visual Samples Acquisition pipeline (\ref{['sec:vea']}). Following this, RAXO extracts the features of the images and segments them with its (2)Class Descriptor Modeling module (\ref{['sec:vcm']}), creating ensemble visual descriptors for the class $\mathcal{X}_c$ and the background $\mathcal{X}_\text{bg}$. Finally, the text-based classifier from the baseline RGB OvOD detector is replaced with our (3)Visual-based Classifier (\ref{['sec:rec']}) $\mathcal{X}$, which yields accurate predictions on the X-ray modality.
  • Figure 3: Web-powered retrieval and material-transfer mechanism for the class "violin". We retrieve violin samples from the web, filter them using $\mathcal{F}_{\text{RGB}}$, and inpaint the retrieved appearance into the object masks to generate synthetic X-ray samples.
  • Figure 4: Impact of $\bm{K}$ in class representations evaluated on PIXray pixray using a G-DINO liu2025grounding baseline in the 100/0 setting.
  • Figure 5: Impact of $\bm{\sigma}$ in Descriptor Consistency Module on PIXray pixray using a G-DINO liu2025grounding baseline in the 100/0 setting.
  • ...and 5 more figures