Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images
You-Kyoung Na, Yeong-Jun Cho
TL;DR
Med-SORA tackles symptom-to-organ reasoning in abdominal CT imaging by constructing a RAG-based, organ-specific symptom-text dataset and learning soft symptom–organ associations via learnable organ anchors. It introduces a 2D-3D cross-attention fusion that combines slice-level detail with full 3D context, and aligns text and image embeddings using an InfoNCE objective. Empirical results on BTCV data show that soft labeling better captures multi-organ relationships and that the 2D-3D fusion yields superior organ identification and reasoning performance, outperforming multiple baselines. The approach yields interpretable 3D visualizations of symptom-related organ involvement, offering a clinically meaningful tool for diagnostic reasoning and education.
Abstract
Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.
