Table of Contents
Fetching ...

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

Chuang Peng, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei

TL;DR

This work proposes DualXrayBench, the first benchmark for X-ray prohibited-item detection that integrates dual-view imagery with multimodal data to test cross-view reasoning. It introduces GSXray, a Chain-of-Thought–guided dataset that provides structured supervision for learning cross-view geometry and cross-modal semantics, enabling a second view to function as a language-like modality. The Geometric-Semantic Reasoner (GSR) fuses a dual-view vision encoder with a language reasoning module and a language-like side input using hierarchical tokens, achieving state-of-the-art results across eight cross-view tasks on DualXrayBench. Together, these resources demonstrate that leveraging a second view through structured cross-view reasoning significantly improves robustness and open-set generalization in X-ray prohibited-item detection, with practical implications for security inspections. The approach offers a new direction for multi-view vision-language models in safety-critical domains, where cross-perspective constraints can resolve occlusions and ambiguities more effectively than single-view methods.

Abstract

Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: <top>, <side>, <conclusion>. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

TL;DR

This work proposes DualXrayBench, the first benchmark for X-ray prohibited-item detection that integrates dual-view imagery with multimodal data to test cross-view reasoning. It introduces GSXray, a Chain-of-Thought–guided dataset that provides structured supervision for learning cross-view geometry and cross-modal semantics, enabling a second view to function as a language-like modality. The Geometric-Semantic Reasoner (GSR) fuses a dual-view vision encoder with a language reasoning module and a language-like side input using hierarchical tokens, achieving state-of-the-art results across eight cross-view tasks on DualXrayBench. Together, these resources demonstrate that leveraging a second view through structured cross-view reasoning significantly improves robustness and open-set generalization in X-ray prohibited-item detection, with practical implications for security inspections. The approach offers a new direction for multi-view vision-language models in safety-critical domains, where cross-perspective constraints can resolve occlusions and ambiguities more effectively than single-view methods.

Abstract

Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: <top>, <side>, <conclusion>. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of existing X-ray prohibited-item detection strategies (a–c) with our DualXrayBench (d), which treats the second-view image as a language-like modality that provides additional constraints to enhance detection.
  • Figure 2: Construction pipeline of the DualXrayBench caption corpus, illustrating metadata preprocessing and extraction, structured prompt design, LLM-based controlled caption generation, automated quality filtering, and expert verification to ensure high-quality captions for X-ray prohibited-item detection and reasoning.
  • Figure 3: Representative examples from DualXrayBench illustrating eight diagnostic tasks. Each example pairs top- and side-view images with annotated evidence and corresponding questions. Tasks encompass perception (counting, recognition), relational reasoning (spatial alignment), occlusion inference (visibility and containment), and attribute understanding (posture, geometry), highlighting the benchmark's coverage of cross-perspective reasoning challenges.
  • Figure 4: Overall architecture of the DualXrayBench–GSR framework: (1) A structured data generation pipeline for creating DualXrayBench and GSXray CoT supervision from raw multi-view X-ray metadata; (2) A supervised fine-tuning framework that learns dual-view geometric correspondences and cross-modal semantics, treating side-view imagery as a language-like modality; (3) A unified inference and evaluation protocol for cross-view reasoning, consistency assessment, and spatial understanding across eight diagnostic tasks.