Table of Contents
Fetching ...

Adapting Video Diffusion Models for Time-Lapse Microscopy

Alexander Holmberg, Nils Mechtel, Wei Ouyang

TL;DR

This work addresses generating biologically plausible time-lapse microscopy videos by fine-tuning a pretrained video diffusion model (CogVideoX) on HeLa cell data. It investigates four conditioning strategies—text prompts from phenotype scores, numeric phenotype embeddings, image-conditioned generation, and an unconditional baseline—to steer temporal dynamics like mitosis and migration. The results show substantial realism gains over zero-shot baselines and demonstrate extrapolation to longer sequences, though explicit phenotype control remains challenging. The approach enables in-silico hypothesis testing and data augmentation for downstream tasks such as cell-tracking robustness under varied perturbations.

Abstract

We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.

Adapting Video Diffusion Models for Time-Lapse Microscopy

TL;DR

This work addresses generating biologically plausible time-lapse microscopy videos by fine-tuning a pretrained video diffusion model (CogVideoX) on HeLa cell data. It investigates four conditioning strategies—text prompts from phenotype scores, numeric phenotype embeddings, image-conditioned generation, and an unconditional baseline—to steer temporal dynamics like mitosis and migration. The results show substantial realism gains over zero-shot baselines and demonstrate extrapolation to longer sequences, though explicit phenotype control remains challenging. The approach enables in-silico hypothesis testing and data augmentation for downstream tasks such as cell-tracking robustness under varied perturbations.

Abstract

We present a domain adaptation of video diffusion models to generate highly realistic time-lapse microscopy videos of cell division in HeLa cells. Although state-of-the-art generative video models have advanced significantly for natural videos, they remain underexplored in microscopy domains. To address this gap, we fine-tune a pretrained video diffusion model on microscopy-specific sequences, exploring three conditioning strategies: (1) text prompts derived from numeric phenotypic measurements (e.g., proliferation rates, migration speeds, cell-death frequencies), (2) direct numeric embeddings of phenotype scores, and (3) image-conditioned generation, where an initial microscopy frame is extended into a complete video sequence. Evaluation using biologically meaningful morphological, proliferation, and migration metrics demonstrates that fine-tuning substantially improves realism and accurately captures critical cellular behaviors such as mitosis and migration. Notably, the fine-tuned model also generalizes beyond the training horizon, generating coherent cell dynamics even in extended sequences. However, precisely controlling specific phenotypic characteristics remains challenging, highlighting opportunities for future work to enhance conditioning methods. Our results demonstrate the potential for domain-specific fine-tuning of generative video models to produce biologically plausible synthetic microscopy data, supporting applications such as in-silico hypothesis testing and data augmentation.

Paper Structure

This paper contains 30 sections, 4 equations, 5 figures, 5 tables, 3 algorithms.

Figures (5)

  • Figure 1: Example nucleus segmentation on a single frame. We apply this segmentation pipeline to every frame in each time-lapse, extracting per-nucleus shape descriptors (e.g., area, eccentricity, solidity, perimeter). These values are then aggregated across frames to compute the population-level morphological metrics used in our evaluation.
  • Figure 2: Overview of evaluation pipeline. Metrics computed from real and generated videos are compared using Wasserstein distance to assess realism and phenotype alignment.
  • Figure 3: Visual comparison of zero-shot CogVideoX (top row), fine-tuned CogVideoX (middle row), and real microscopy frames (bottom row) at four time points (0, 20, 60, 80). The zero-shot baseline produces unrealistic backgrounds and fails to capture cell divisions, whereas the fine-tuned model generates biologically plausible mitotic events and cell population growth patterns that closely resemble real data.
  • Figure 4: Comparison of average cell counts between real videos and our unconditional model, extended to 129 frames. The shaded region indicates the standard deviation across multiple sequences.
  • Figure 5: Visual comparison of zero-shot baseline vs. LoRA fine-tuned vs. full fine-tuned vs. real microscopy sequences. The fine-tuned models generates biologically plausible cell divisions similar to real data, while the zero-shot baseline clearly doesn't.