Table of Contents
Fetching ...

SynCellFactory: Generative Data Augmentation for Cell Tracking

Moritz Sturm, Lorenzo Cerrone, Fred A. Hamprecht

TL;DR

SynCellFactory tackles data scarcity in cell tracking by decoupling appearance and dynamics and using diffusion-based rendering to generate photorealistic, annotated cell videos. It employs a motion model to simulate cell populations and two specialized ControlNets (CN-Pos for accurate positioning and CN-Mov for temporal evolution) to produce sequences with pseudo ground-truth segmentation. Automated training minimizes domain knowledge, enabling scalable augmentation from a single annotated timelapse; experiments on seven 2D CTC datasets show TRA improvements for most cases and three official CTC results surpassing prior methods. The approach demonstrates the practical potential of generative AI for boosting deep learning-based cell tracking, while acknowledging limitations in complex scenes and long sequences that guide future enhancements.

Abstract

Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic yet realistic cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.

SynCellFactory: Generative Data Augmentation for Cell Tracking

TL;DR

SynCellFactory tackles data scarcity in cell tracking by decoupling appearance and dynamics and using diffusion-based rendering to generate photorealistic, annotated cell videos. It employs a motion model to simulate cell populations and two specialized ControlNets (CN-Pos for accurate positioning and CN-Mov for temporal evolution) to produce sequences with pseudo ground-truth segmentation. Automated training minimizes domain knowledge, enabling scalable augmentation from a single annotated timelapse; experiments on seven 2D CTC datasets show TRA improvements for most cases and three official CTC results surpassing prior methods. The approach demonstrates the practical potential of generative AI for boosting deep learning-based cell tracking, while acknowledging limitations in complex scenes and long sequences that guide future enhancements.

Abstract

Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic yet realistic cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.
Paper Structure (26 sections, 9 equations, 15 figures, 4 tables)

This paper contains 26 sections, 9 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Showcase of real (top row) and synthetic (other rows) images generated using SynCellFactory. The training data sets are a subset of the 2D Cell Tracking Challenge CTC_oldCTC_23 and provide a broad spectrum of cell lines and microscopy modalities.
  • Figure 2: SynCellFactory is a data augmentation pipeline designed to create unlimited high-quality synthetic raw video data and corresponding pseudo ground truth. It trains three key components using a small, and possibly sparsely labeled data set: Positional ControlNet (CN-Pos), Movement ControlNet (CN-Mov), and a 2D movement engine for realistic simulations. The process initiates in reverse, with the motion model generating a conditioning image at time $T$ for CN-Pos. This image illustrates the expected centers of cells using colored dots, where each color signifies a specific cell state in the mitotic cycle: green during the interphase and blue during the different phases of cell division. CN-Pos then employs this information to generate a realistic frame for time $T$. Subsequently, CN-Mov assumes the role of producing the next frame $T-1$, using as conditioning an RGB image that combines the previously generated frame (in the red channel) with the projected positions and movement patterns (in green and blue channels). Derived from the motion model, these patterns represent each cell's trajectory from its current to its anticipated next position as a line connecting the two. By iteratively applying CN-Mov, SynCellFactory can efficiently produce time-lapse sequences of any desired length, suitable for training deep learning pipelines in cell tracking.
  • Figure 3: Train-time computation flow in the ControlNet CNet. Initially, the source image $I_\text{tgt}$, and the conditioning input $c_{\text{img}}$ are transformed into $4\times64\times64$ embeddings through two convolutional encoders: $\mathcal{E}_{\text{SD}}$ for the source and $\mathcal{E}_{\text{CNet}}$ for conditioning. In the stable diffusion training routine, the source embedding undergoes a forward diffusion process, where Gaussian noise is incrementally added. Subsequently, a UNet encoder-decoder attempts to estimate and revert this noise perturbation. This is accomplished by applying the mean square error loss between the input noise $\epsilon_{\mathcal{N}}$ and the decoder output $\epsilon_{\Theta}$. Unique to ControlNet architecture, additional image conditioning is integrated via an auxiliary branch. In the diagram, neural network blocks with solid lines represent components with fixed parameters during training, while those with dotted lines indicate blocks subject to finetuning. The described architecture is used in both CN-Pos and CN-Mov.
  • Figure 4: Quantitative results according to the Tracking Accuracy Measure TRA (higher is better). We trained the EmbedTrack model without data augmentation (black square) and with different real and synthetic training data mixing ratios $\alpha$. Here, one can observe that although SynCellFactory augmentation positively impacts the TRA score in all but one of the tested data sets, the correct choice of $\alpha$ is critical for the model we benchmarked. Error bars indicate the standard deviation over three runs.
  • Figure 5: Comparison of real (top) and simulated mitosis (bottom). Shown here are the cell division at time $t$, plus three frames before and after. Key differences include the simulated cell being brighter and lacking the gradual brightness change seen in the real cell before the split. However, the model accurately simulates cell contraction at $t-1$ and realistic artifacts at the split. Post-split, the simulated daughter cells transition realistically from $t+1$ to $t+3$.
  • ...and 10 more figures