Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
Chuang Peng, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei
TL;DR
This work proposes DualXrayBench, the first benchmark for X-ray prohibited-item detection that integrates dual-view imagery with multimodal data to test cross-view reasoning. It introduces GSXray, a Chain-of-Thought–guided dataset that provides structured supervision for learning cross-view geometry and cross-modal semantics, enabling a second view to function as a language-like modality. The Geometric-Semantic Reasoner (GSR) fuses a dual-view vision encoder with a language reasoning module and a language-like side input using hierarchical tokens, achieving state-of-the-art results across eight cross-view tasks on DualXrayBench. Together, these resources demonstrate that leveraging a second view through structured cross-view reasoning significantly improves robustness and open-set generalization in X-ray prohibited-item detection, with practical implications for security inspections. The approach offers a new direction for multi-view vision-language models in safety-critical domains, where cross-perspective constraints can resolve occlusions and ambiguities more effectively than single-view methods.
Abstract
Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: <top>, <side>, <conclusion>. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.
