Table of Contents
Fetching ...

Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images

You-Kyoung Na, Yeong-Jun Cho

TL;DR

Med-SORA tackles symptom-to-organ reasoning in abdominal CT imaging by constructing a RAG-based, organ-specific symptom-text dataset and learning soft symptom–organ associations via learnable organ anchors. It introduces a 2D-3D cross-attention fusion that combines slice-level detail with full 3D context, and aligns text and image embeddings using an InfoNCE objective. Empirical results on BTCV data show that soft labeling better captures multi-organ relationships and that the 2D-3D fusion yields superior organ identification and reasoning performance, outperforming multiple baselines. The approach yields interpretable 3D visualizations of symptom-related organ involvement, offering a clinically meaningful tool for diagnostic reasoning and education.

Abstract

Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.

Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images

TL;DR

Med-SORA tackles symptom-to-organ reasoning in abdominal CT imaging by constructing a RAG-based, organ-specific symptom-text dataset and learning soft symptom–organ associations via learnable organ anchors. It introduces a 2D-3D cross-attention fusion that combines slice-level detail with full 3D context, and aligns text and image embeddings using an InfoNCE objective. Empirical results on BTCV data show that soft labeling better captures multi-organ relationships and that the 2D-3D fusion yields superior organ identification and reasoning performance, outperforming multiple baselines. The approach yields interpretable 3D visualizations of symptom-related organ involvement, offering a clinically meaningful tool for diagnostic reasoning and education.

Abstract

Understanding symptom-image associations is crucial for clinical reasoning. However, existing medical multimodal models often rely on simple one-to-one hard labeling, oversimplifying clinical reality where symptoms relate to multiple organs. In addition, they mainly use single-slice 2D features without incorporating 3D information, limiting their ability to capture full anatomical context. In this study, we propose Med-SORA, a framework for symptom-to-organ reasoning in abdominal CT images. Med-SORA introduces RAG-based dataset construction, soft labeling with learnable organ anchors to capture one-to-many symptom-organ relationships, and a 2D-3D cross-attention architecture to fuse local and global image features. To our knowledge, this is the first work to address symptom-to-organ reasoning in medical multimodal learning. Experimental results show that Med-SORA outperforms existing medical multimodal models and enables accurate 3D clinical reasoning.

Paper Structure

This paper contains 14 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between existing medical vision-language models and the proposed Med-SORA. Ground-truth organs are marked with $\star$.
  • Figure 2: The pipeline of Med-SORA, showing the optimization process for $o_1=\text{liver}$. It consists of (a) RAG-based dataset construction, (b) soft labeling-based text embedding learning, and (c) 3D-2D feature-based image embedding learning with text-image embedding alignment.
  • Figure 3: Soft labeling for symptom text embeddings via learnable anchors. Different colors represent organ classes. $\leftrightarrow$ means similarity computation, while pull/push operations show the optimization process for $s^+$ and $s^-$ with margin $m$.
  • Figure 4: The proposed 2D-3D feature fusion architecture.
  • Figure 5: t-SNE visualization of symptom text embeddings learned with soft labeling. Different colors indicate different organ classes, and star symbols denote the learned positive anchors for each organ. (Best viewed in color)
  • ...and 3 more figures