Table of Contents
Fetching ...

STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations

Boyang Fu, George Dasoulas, Sameer Gabbita, Xiang Lin, Shanghua Gao, Xiaorui Su, Soumya Ghosh, Marinka Zitnik

TL;DR

STRAND addresses the locus-resolution gap in single-cell perturbation prediction by conditioning perturbations on regulatory DNA sequence and modeling the response as a sequence-conditioned transport from control to perturbed states. It combines a DNA-based perturbation module with a control-anchored latent diffusion (I2SB Bridge) and a CLIP-alignment objective within a modular framework that supports OT-based pairing and replacement of DNA/RNA encoders. The approach yields state-of-the-art performance in low-sample and zero-shot settings, enables sequence-resolved in silico perturbation profiling, and recovers functionally relevant regulatory elements such as alternative transcription start sites. By expanding inference coverage to ~$95\%$ of the genome and enabling locus-level perturbation predictions, STRAND has potential to guide functional genomics studies and the design of genome-scale perturbations.

Abstract

Predicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation. We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome. We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models.

STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations

TL;DR

STRAND addresses the locus-resolution gap in single-cell perturbation prediction by conditioning perturbations on regulatory DNA sequence and modeling the response as a sequence-conditioned transport from control to perturbed states. It combines a DNA-based perturbation module with a control-anchored latent diffusion (I2SB Bridge) and a CLIP-alignment objective within a modular framework that supports OT-based pairing and replacement of DNA/RNA encoders. The approach yields state-of-the-art performance in low-sample and zero-shot settings, enables sequence-resolved in silico perturbation profiling, and recovers functionally relevant regulatory elements such as alternative transcription start sites. By expanding inference coverage to ~ of the genome and enabling locus-level perturbation predictions, STRAND has potential to guide functional genomics studies and the design of genome-scale perturbations.

Abstract

Predicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation. We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome. We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models.
Paper Structure (93 sections, 53 equations, 9 figures, 11 tables)

This paper contains 93 sections, 53 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Overview of STRAND. STRAND conditions perturbation effects on regulatory DNA sequence, to perform in-silico profiling of locus-specific perturbations at nucleotide resolution. Perturbations targeting the same gene at different genomic locations ($s_p^1$, $s_p^2$) can induce distinct transcriptional responses.
  • Figure 2: STRAND architecture design. The framework integrates genomic and transcriptomic modalities to predict perturbed cell expression. The pipeline consists of three main learnable components: (1) A DNA Perturbation Module$g_{\phi}(\cdot)$ that processes representations from a frozen Borzoi model and perturbation masks to produce the perturbation embedding $\pmb{\mu}_p$; (2) A Latent Cellular Transport Module utilizing latent diffusion, $\epsilon_{\theta}(\mathbf{z}_{t},t, \mathbf{u}_p, \mathbf{z}_c)$, to synthesize the perturbed cell embedding $\hat{\mathbf{z}}_{p}$ conditioned on the control cell embedding; and (3) A Gene Level Decoding Module that projects the synthesized latent features to the final predicted perturbed expression.
  • Figure 3: SOTA methods Benchmark on Low-Sample Gene Perturbation. STRAND is compared against PMean (PerturbMean), STATE, Linear, and Biolord on the Combined data. GEARS is excluded as it doesn't support cross-cell-line joint training.
  • Figure 4: Ablation of the STRAND DNA Perturbation Module. Models are trained on the Combined dataset. Plots show statistics for the top-$K$ genes with the largest absolute expression changes. Shaded regions denote 95% CIs.
  • Figure 5: Ablation of the STRAND RNA Generative Transport Generative model demonstrates better fidelity in modeling the distribution
  • ...and 4 more figures