Table of Contents
Fetching ...

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

Ihab Bendidi, Yassir El Mesbahi, Alisandra K. Denton, Karush Suri, Kian Kenyon-Dean, Auguste Genovesio, Emmanuel Noutahi

TL;DR

This work tackles the scarcity of fully paired multimodal data by learning from weakly paired transcriptomics and microscopy samples to enrich transcriptomic representations with morphological cues. It introduces Semi-Clipped, a CLIP-inspired cross-modal distillation framework with frozen image encoders and trainable transcriptomics adapters, and PEA, a biologically grounded perturbation embedding augmentation that reuses batch-correction ideas. Across extensive out-of-distribution evaluations (HUVEC-KO, LINCS, SC-RPE1), Semi-Clipped with PEA achieves state-of-the-art Known Biological Relationship Recall while preserving transcriptomic interpretability, and ablations demonstrate the complementary, synergistic benefits of combining these approaches. The method is efficient (training on 1.3 million weakly paired samples in about 19 hours on a single H100) and yields richer, more actionable transcriptomics representations for drug discovery and cellular phenotyping.

Abstract

Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

TL;DR

This work tackles the scarcity of fully paired multimodal data by learning from weakly paired transcriptomics and microscopy samples to enrich transcriptomic representations with morphological cues. It introduces Semi-Clipped, a CLIP-inspired cross-modal distillation framework with frozen image encoders and trainable transcriptomics adapters, and PEA, a biologically grounded perturbation embedding augmentation that reuses batch-correction ideas. Across extensive out-of-distribution evaluations (HUVEC-KO, LINCS, SC-RPE1), Semi-Clipped with PEA achieves state-of-the-art Known Biological Relationship Recall while preserving transcriptomic interpretability, and ablations demonstrate the complementary, synergistic benefits of combining these approaches. The method is efficient (training on 1.3 million weakly paired samples in about 19 hours on a single H100) and yields richer, more actionable transcriptomics representations for drug discovery and cellular phenotyping.

Abstract

Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Impact of training choices on Semi-Clipped performance for known biological relationship recall on HUVEC-KO. Finetuning or multimodal training from scratch underperforms due to limited weakly paired data, while using adapters on pretrained models significantly improves results. The best performance is achieved with Semi-Clipped : a single transcriptomic adapter aligned to frozen image representations.
  • Figure 2: Performance comparison of the distillation and augmentation components of our approach compared to existing distillation methods (a) and biological data augmentation techniques (b) across five training seeds. Higher is better for all metrics. Semi-Clipped and PEA maintain interpretability and achieve the highest performance on all OOD datasets. (a) Z-scores of evaluation metrics (relationship recall and Tx preservability) are shown, with cool colors for label-based methods and warm colors for label-free approaches, without data augmentation. (b) Raw scores are shown for relationship recall and Tx preservability. Transcriptomics data augmentations, MWO Kircher2022, scVI denoising scvilopez2018deep, MDWGAN-GP Li2023-oa, scGFT Nouri2025, are applied within Semi-Clipped training. We compare training results where we simultaneously use all evaluated data augmentations, both with and without PEA, to assess its additional impact in a practical setting on both evaluation tasks.
  • Figure 3: Ablation study on the known relationship recall score of hyperparameters choices (Tx Adapter learning rate, CLIP loss temperature, batch size, and training epochs) for training Semi-Clipped on the HUVEC-KO dataset, including the selected optimal configuration (dotted vertical line). For each studied parameter, we set all other hyperparameters at their best performing value. While performance varies with parameter changes, the method remains largely robust, showing minimal degradation and no collapse
  • Figure 4: Venn diagrams of retrieved biological relationships for KD, SHAKE, VICReg, and Semi-Clipped (all trained with PEA) on the HUVEC-KO OOD dataset. Semi-Clipped shows the highest overlap with transcriptomics while integrating morphological insights, whereas KD and SHAKE exhibit the weakest alignment, possibly due to reliance on weak biological labels. Detailled measures of the gains and losses of each method in each modality are available in Figure \ref{['fig:gains_and_losses']}.
  • Figure 5: Comparison of relationship gains and losses across cross-modal distillation methods shown in Figure \ref{['fig:venn_full_comparison']}. Our approach achieves the highest overall relationship recall and best preserves transcriptomic information.
  • ...and 2 more figures