Table of Contents
Fetching ...

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, Peng LU, Yueting Zhuang, Yingda Xia, Ling Zhang, Beng Chin Ooi

TL;DR

OmniCT is presented, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation.

Abstract

Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

TL;DR

OmniCT is presented, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation.

Abstract

Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.
Paper Structure (24 sections, 7 equations, 8 figures, 16 tables)

This paper contains 24 sections, 7 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: (a) is the statistics of the proposed MedEval-CT-Dataset. (b) describes the simplified architecture of proposed OmniCT. (c) shows that OmniCT consistently surpasses all baselines on both slice-driven and volume-driven CT benchmarks.
  • Figure 2: The architecture of OmniCT, a unified slice–volume LVLM paradigm.
  • Figure 3: (a) and (b) illustrate the data distribution of MedEval-CT-Bench at the slice and volume levels, respectively, encompassing both the clinical-based categorization (4 types: GIR, MAI, AII, and CRD) and the organ-level distribution (13 organs). (c) presents the data engineering pipeline.
  • Figure 4: (a) Comparison of OmniCT with 2D/3D LVLMs on 2D/3D benchmarks using 30%, 100% training data of 2D, 3D, and mixed 2D/3D. (b) The study of using a 3D vision encoder, 2D vision encoders by different pre-training ways. (c) Per-organ performance heatmap of 2D/3D models and OmniCT on 2D/3D MedEval-CT-Bench. (d) Performance heatmaps by clinical task category and bar charts comparing performance with clinical knowledge requirements across task categories.
  • Figure 5: Prompt template of data orchestration engine for generating MedEval-CT.
  • ...and 3 more figures