Table of Contents
Fetching ...

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo, Kevin Zhang, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Linda Shapiro, Ranjay Krishna

TL;DR

MedicalNarratives addresses the scarcity of large-scale grounded multimodal data in medicine by curating 4.7M image-text pairs from YouTube pedagogy videos and PubMed, including 1M samples with localized traces for spatiotemporal grounding. The authors train GenMedCLIP, a CLIP-like vision-language model, across 12 medical domains and show consistent improvements over state-of-the-art baselines on a comprehensive downstream medical benchmark, as well as strong zero-shot and cross-modal retrieval performance. They demonstrate that interleaved video and article data, along with trace-based grounding, provide valuable supervision for both classification and retrieval, and they outline broad opportunities for spatially controlled generation, interactive segmentation, and grounded reporting. The work introduces a scalable data-curation pipeline, highlights limitations such as lack of expert bounding boxes and biases toward abnormal cases, and provides a foundation for future spatially aware medical VL models and applications.

Abstract

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to $\textit{think-aloud}$ studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GenMedClip with a CLIP-like objective using our dataset spanning 12 medical domains. GenMedClip outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark. $\href{https://huggingface.co/datasets/wisdomik/MedicalNarratives}{[Data]}$

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

TL;DR

MedicalNarratives addresses the scarcity of large-scale grounded multimodal data in medicine by curating 4.7M image-text pairs from YouTube pedagogy videos and PubMed, including 1M samples with localized traces for spatiotemporal grounding. The authors train GenMedCLIP, a CLIP-like vision-language model, across 12 medical domains and show consistent improvements over state-of-the-art baselines on a comprehensive downstream medical benchmark, as well as strong zero-shot and cross-modal retrieval performance. They demonstrate that interleaved video and article data, along with trace-based grounding, provide valuable supervision for both classification and retrieval, and they outline broad opportunities for spatially controlled generation, interactive segmentation, and grounded reporting. The work introduces a scalable data-curation pipeline, highlights limitations such as lack of expert bounding boxes and biases toward abnormal cases, and provides a foundation for future spatially aware medical VL models and applications.

Abstract

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GenMedClip with a CLIP-like objective using our dataset spanning 12 medical domains. GenMedClip outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark.
Paper Structure (43 sections, 25 figures, 11 tables)

This paper contains 43 sections, 25 figures, 11 tables.

Figures (25)

  • Figure 1: MedicalNarratives:Examples from our medical imaging modalities, excluding surgery, endoscopy, and general medical images due to their graphic nature. These samples are selected from interleaved video samples, with each sample showing the image, denoised text, and spatial traces & bbox aligned in-time on 4 domains. See section \ref{['supp:examples']} in the Appendix for more examples and raw input text.
  • Figure 2: Breakdown of MedicalNarratives in size by modalities across both video and article subsets.
  • Figure 3: The data curation pipeline for the Video subset of the MedicalNarratives dataset. Search: YouTube video-first search strategy, with filtering by pre-trained classifiers and heuristics. Image: Extracting keyframes of a video, denoising, and identifying medical images. Text: ASR transcription, text correction with LLMs, and medical/ROI text extraction. Traces: Identifying stable chunks of a video, then localizing cursor traces within each chunk. Alignment: Mapping medical/ROI text, traces, and images together. Samples are classified into finer-grained subdomains, and samples with discussions of multiple domains are identified with LLMs.
  • Figure 4: Zeroshot Classification Results shows that our model GenMedCLIP outperforms all other baselines including the out-of-domain CLIP, and biomedical vision-language models BiomedCLIP, and PubMedCLIP across the constructed medical benchmark which covers all 11 medical domains represented. The metric for Xray and Mammography is mean average precision while the rest is accuracy.
  • Figure 5: Using trace as prompts for segmentation using ScribblePrompt-SAM. (Right) resulting mask from trace (Center).
  • ...and 20 more figures