Table of Contents
Fetching ...

Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Yuxiang Lai, Jike Zhong, Ming Li, Yuheng Li, Xiaofeng Yang

TL;DR

The paper shows that a large vision model, trained on natural images and videos, can zero-shotly perform multiple medical imaging tasks on CT sequences, including segmentation, denoising, super-resolution, and radiotherapy motion prediction. By tokenizing CT slices with VQGAN and modeling the sequence of phases with a decoder-only Transformer, the approach captures temporal dynamics and anatomical priors without task-specific fine-tuning. It demonstrates competitive segmentation and superior motion-prediction performance, often surpassing DVF-based and other baselines, and reveals emergent reasoning capabilities across unseen medical tasks. These findings suggest that video foundation models can serve as unified backbones for medical imaging, enabling scalable, temporal, and cross-task AI systems for clinical workflows, with potential extensions to longitudinal and multi-modality data.

Abstract

Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

TL;DR

The paper shows that a large vision model, trained on natural images and videos, can zero-shotly perform multiple medical imaging tasks on CT sequences, including segmentation, denoising, super-resolution, and radiotherapy motion prediction. By tokenizing CT slices with VQGAN and modeling the sequence of phases with a decoder-only Transformer, the approach captures temporal dynamics and anatomical priors without task-specific fine-tuning. It demonstrates competitive segmentation and superior motion-prediction performance, often surpassing DVF-based and other baselines, and reveals emergent reasoning capabilities across unseen medical tasks. These findings suggest that video foundation models can serve as unified backbones for medical imaging, enabling scalable, temporal, and cross-task AI systems for clinical workflows, with potential extensions to longitudinal and multi-modality data.

Abstract

Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

Paper Structure

This paper contains 15 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Zero-shot learning and reasoning examples of the video model in medical imaging. From low-level perceptual restoration (super-resolution, denoising) to high-level understanding tasks (segmentation, motion modeling, and prediction), the video model can perform a range of medical imaging tasks directly from CT sequences without task-specific training. The examples highlight the potential to further advance video models toward becoming foundational vision models for medical imaging. .
  • Figure 2: Schematic illustration of intrafractional tumor motion caused by respiratory cycles during thoracic and upper-abdominal radiotherapy. The top panel depicts the periodic expansion and contraction of the lungs, which drives not only pulmonary tumors but also displaces nearby organs such as the liver and heart, resulting in complex, predominantly vertical motion trajectories (blue dashed curve). The static radiation field (red dashed rectangle) is conventionally planned to encompass the entire tumor motion track to avoid geographic miss, but this inevitably irradiates more surrounding healthy tissue. The green checkmarks (✓) and red crosses (✗) mark tumor positions that are either fully covered or missed by the treatment field at different respiratory phases. This figure highlights a key clinical challenge: without accurate motion modeling, margins must be expanded to ensure target coverage for lung, liver, and cardiac-adjacent tumors, which increases unnecessary dose to adjacent organs-at-risk. Precise prediction of tumor trajectories can enable motion-adaptive strategies (e.g., gating, tracking) that safely reduce margins and support patient-specific motion management across multiple thoracic and abdominal sites.
  • Figure 3: Multi-phase motion prediction on the public dataset. We evaluate model performance on the public 4D CT dataset using Dice Similarity Coefficient (DSC, %). Each model is provided with the first five phases of the 4D CT scan and autoregressively predicts the next five phases. The plots show phase-by-phase DSC for five representative methods (DAM, DiffuseRT, ConvLSTM, RMSim, and our proposed LVM). LVM consistently achieves the highest DSC across all predicted phases and exhibits the smallest performance drop from phase #1 to #5, indicating its superior ability to model smooth and realistic multi-phase motion patterns.
  • Figure 4: Multi-phase motion prediction on the private dataset. The same DSC-based evaluation is conducted on our institutional 4D CT dataset (including lung, heart, and liver cases). Each model receives the first five phases and must generate the subsequent five phases. LVM maintains consistently higher DSC across all organs and phases, with smoother phase-to-phase transitions and less degradation compared to competing methods, demonstrating strong robustness and generalization to in-house data.
  • Figure 5: Qualitative visualization of lung motion. The first five phases are used as input, and the model predicts the next five. Each heatmap shows voxel-wise pixel differences between the ground truth (GT) and either the previous phase or the model prediction. Red indicates larger discrepancies. LVM accurately captures respiratory-induced motion, showing reduced errors and smoother temporal transitions compared to the prior phase, demonstrating coherent and anatomically consistent lung motion prediction.
  • ...and 2 more figures