Table of Contents
Fetching ...

BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval

Yinda Chen, Che Liu, Xiaoyu Liu, Rossella Arcucci, Zhiwei Xiong

TL;DR

The paper tackles the lack of robust benchmarks for 3D medical text-image retrieval by introducing BIMCV-R, a public dataset of 8,069 3D CT volumes with radiology reports totaling over 2 million slices and extensive expert annotations. It presents MedFinder, a dual-stream retrieval framework that leverages BiomedCLIP-based language representations, text sampling, view-consistency, and cross-attention fusion to align 3D CT imagery with clinical narratives and enable keyword-based search, optimized with a joint objective $L_{total} = L_{mse} + \alpha L_{sim}$. The authors demonstrate superior performance over baselines in multimodal retrieval and show practical utility for keyword-based retrieval, highlighting the potential of large language models to enhance 3D medical image retrieval. This work establishes BIMCV-R as a foundational benchmark and paves the way for scalable, clinician-friendly, text-guided retrieval of complex 3D medical imaging data, with immediate relevance to diagnostic support and case-based reference.

Abstract

The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, {BIMCV-R}, which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks. Our project is available at \url{https://huggingface.co/datasets/cyd0806/BIMCV-R}.

BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval

TL;DR

The paper tackles the lack of robust benchmarks for 3D medical text-image retrieval by introducing BIMCV-R, a public dataset of 8,069 3D CT volumes with radiology reports totaling over 2 million slices and extensive expert annotations. It presents MedFinder, a dual-stream retrieval framework that leverages BiomedCLIP-based language representations, text sampling, view-consistency, and cross-attention fusion to align 3D CT imagery with clinical narratives and enable keyword-based search, optimized with a joint objective . The authors demonstrate superior performance over baselines in multimodal retrieval and show practical utility for keyword-based retrieval, highlighting the potential of large language models to enhance 3D medical image retrieval. This work establishes BIMCV-R as a foundational benchmark and paves the way for scalable, clinician-friendly, text-guided retrieval of complex 3D medical imaging data, with immediate relevance to diagnostic support and case-based reference.

Abstract

The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, {BIMCV-R}, which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks. Our project is available at \url{https://huggingface.co/datasets/cyd0806/BIMCV-R}.
Paper Structure (14 sections, 7 equations, 5 figures, 4 tables)

This paper contains 14 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Construction of the BIMCV-R dataset. Utilizing the BIMCV dataset, we enhanced image quality through selective filtering, advanced denoising, and size standardization. For textual data, we translated radiological reports into English and refined them with GPT-4, ensuring consistency. Expert reviews and diagnoses further ensured data reliability and accuracy.
  • Figure 1: Summary of Image and Report Statistics.
  • Figure 2: Sample data of BIMCV-R.
  • Figure 3: Left: Word Frequency Analysis. Right: World Cloud Analysis.
  • Figure 4: An overview of our method, divided into textual feature extraction, visual feature extraction, and similarity matching.