Table of Contents
Fetching ...

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder, Angela Zhang, Ben Athiwaratkun, Shuaiwen Leon Song, David Ouyang, James Zou

TL;DR

The paper investigates whether publicly available educational biomedical videos can train general vision-language models. It introduces OpenBiomedVid, a 1031-hour instructional video dataset with cleaning, frame grounding, and an instruction-tuning signal, along with two expert benchmarks for video understanding. Fine-tuning Qwen-2-VL models on this data yields substantial gains on video and image benchmarks, with more nuanced effects on text tasks, demonstrating the viability of video-centric supervision for biomedical VLMs. The work highlights both the potential and the challenges of scaling to long-form biomedical video understanding and provides a foundation for future multimodal, domain-specific AI in medicine.

Abstract

Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVi, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs.

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

TL;DR

The paper investigates whether publicly available educational biomedical videos can train general vision-language models. It introduces OpenBiomedVid, a 1031-hour instructional video dataset with cleaning, frame grounding, and an instruction-tuning signal, along with two expert benchmarks for video understanding. Fine-tuning Qwen-2-VL models on this data yields substantial gains on video and image benchmarks, with more nuanced effects on text tasks, demonstrating the viability of video-centric supervision for biomedical VLMs. The work highlights both the potential and the challenges of scaling to long-form biomedical video understanding and provides a foundation for future multimodal, domain-specific AI in medicine.

Abstract

Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVi, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs.

Paper Structure

This paper contains 41 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Data generation pipeline. (1) Dataset Curation: Videos are collected from YouTube using manually and clinically guided search queries. (2) Biomedical Frame Segmentation: A SigLIP model is fine-tuned using GPT-labeled data to detect biomedical frames. (3) Caption Refinement: Transcriptions generated using Whisper are filtered and cleaned using GPT-4o to ensure visual grounding. (4) Instruction Data Generation: GPT-4o is used to generate multi-turn Q/A pairs and extract structured metadata from the cleaned captions. Human verification is incorporated throughout the pipeline to ensure high data quality.
  • Figure 2: Distribution of the fine-tuning dataset across different biomedical modalities and anatomical regions.
  • Figure 3: Comparison between the fine-tuning dataset and the evaluation dataset, highlighting the distribution shift. The evaluation dataset is significantly cleaner and more structured than the fine-tuning dataset, enabling a more accurate assessment of model performance.
  • Figure 4: Comparison of fine-tuned models and baseline models across video, image, and text benchmarks.
  • Figure S1: Statistics of the SurgeryVideoQA evaluation dataset, showing the distribution of videos and Q/A pairs across different anatomical regions and surgical procedures.
  • ...and 7 more figures