Table of Contents
Fetching ...

Bora: Biomedical Generalist Video Generation Model

Weixiang Sun, Xiaocao You, Ruizhe Zheng, Zhengqing Yuan, Xiang Li, Lifang He, Quanzheng Li, Lichao Sun

TL;DR

Bora addresses a critical gap in biomedical video generation by introducing a spatio-temporal diffusion model that follows complex medical instructions. It leverages a Transformer-based diffusion backbone initialized from a general video model and refines capability through two-stage domain adaptation on a new biomedical text-video corpus, aided by LLM-generated captions and background knowledge. The approach yields high-quality videos across endoscopy, ultrasound, real-time MRI, and cellular motility, outperforming several baselines in realism and biomedical understanding while maintaining modality-wide consistency. The work also delivers the first comprehensive annotated biomedical video dataset and discusses practical implications for medical education, consultation, and AR/VR training, while acknowledging data availability and duration limitations.

Abstract

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical procedures and detailed anatomical structures. This paper introduces Bora, the first spatio-temporal diffusion probabilistic model designed for text-guided biomedical video generation. Bora leverages Transformer architecture and is pre-trained on general-purpose video generation tasks. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus, which includes paired text-video data from various biomedical fields. To the best of our knowledge, this is the first attempt to establish such a comprehensive annotated biomedical video dataset. Bora is capable of generating high-quality video data across four distinct biomedical domains, adhering to medical expert standards and demonstrating consistency and diversity. This generalist video generative model holds significant potential for enhancing medical consultation and decision-making, particularly in resource-limited settings. Additionally, Bora could pave the way for immersive medical training and procedure planning. Extensive experiments on distinct medical modalities such as endoscopy, ultrasound, MRI, and cell tracking validate the effectiveness of our model in understanding biomedical instructions and its superior performance across subjects compared to state-of-the-art generation models.

Bora: Biomedical Generalist Video Generation Model

TL;DR

Bora addresses a critical gap in biomedical video generation by introducing a spatio-temporal diffusion model that follows complex medical instructions. It leverages a Transformer-based diffusion backbone initialized from a general video model and refines capability through two-stage domain adaptation on a new biomedical text-video corpus, aided by LLM-generated captions and background knowledge. The approach yields high-quality videos across endoscopy, ultrasound, real-time MRI, and cellular motility, outperforming several baselines in realism and biomedical understanding while maintaining modality-wide consistency. The work also delivers the first comprehensive annotated biomedical video dataset and discusses practical implications for medical education, consultation, and AR/VR training, while acknowledging data availability and duration limitations.

Abstract

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical procedures and detailed anatomical structures. This paper introduces Bora, the first spatio-temporal diffusion probabilistic model designed for text-guided biomedical video generation. Bora leverages Transformer architecture and is pre-trained on general-purpose video generation tasks. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus, which includes paired text-video data from various biomedical fields. To the best of our knowledge, this is the first attempt to establish such a comprehensive annotated biomedical video dataset. Bora is capable of generating high-quality video data across four distinct biomedical domains, adhering to medical expert standards and demonstrating consistency and diversity. This generalist video generative model holds significant potential for enhancing medical consultation and decision-making, particularly in resource-limited settings. Additionally, Bora could pave the way for immersive medical training and procedure planning. Extensive experiments on distinct medical modalities such as endoscopy, ultrasound, MRI, and cell tracking validate the effectiveness of our model in understanding biomedical instructions and its superior performance across subjects compared to state-of-the-art generation models.
Paper Structure (31 sections, 4 equations, 15 figures, 3 tables)

This paper contains 31 sections, 4 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: The overall process for generating captions. First, the agent extracts background information from the corresponding dataset, which is then injected into the LLM. Then, combined with the frame sequences, it generates high-quality captions.
  • Figure 2: Some simple video examples produced by Bora and their corresponding text prompts showcase four biological modalities: endoscopy, ultrasound, real-time MRI, and cellular visualization.
  • Figure 3: The overall architecture and training details of our Bora.
  • Figure 4: The comparison of generated video under the same prompt in endoscopy modal. From top to bottom are from Bora, Pika, PixVerse, Gen-2, ModelScope, and Lavie.
  • Figure 5: The distribution of video length (on the y-axis) and caption length (on the x-axis) in our text-video pair dataset, along with its fitted curve.
  • ...and 10 more figures