How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa; Andrew Li; Qingyang Wu; Bryan He; Yuki Sahashi; Christina Binder; Angela Zhang; Ben Athiwaratkun; Shuaiwen Leon Song; David Ouyang; James Zou

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder, Angela Zhang, Ben Athiwaratkun, Shuaiwen Leon Song, David Ouyang, James Zou

TL;DR

The paper investigates whether publicly available educational biomedical videos can train general vision-language models. It introduces OpenBiomedVid, a 1031-hour instructional video dataset with cleaning, frame grounding, and an instruction-tuning signal, along with two expert benchmarks for video understanding. Fine-tuning Qwen-2-VL models on this data yields substantial gains on video and image benchmarks, with more nuanced effects on text tasks, demonstrating the viability of video-centric supervision for biomedical VLMs. The work highlights both the potential and the challenges of scaling to long-form biomedical video understanding and provides a foundation for future multimodal, domain-specific AI in medicine.

Abstract

Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVi, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs.

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

TL;DR

Abstract

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)