Table of Contents
Fetching ...

Latent Diffusion Models for Controllable RNA Sequence Generation

Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang

TL;DR

This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths that outperforms baselines in balancing rewards and structural stability trade-off, and holds potential for advancing RNA sequence-function research and therapeutic RNA design.

Abstract

This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models--surrogates for RNA functional properties--into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, we fine-tune the diffusion model on mRNA 5' untranslated regions (5'-UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-off. Our findings hold potential for advancing RNA sequence-function research and therapeutic RNA design.

Latent Diffusion Models for Controllable RNA Sequence Generation

TL;DR

This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths that outperforms baselines in balancing rewards and structural stability trade-off, and holds potential for advancing RNA sequence-function research and therapeutic RNA design.

Abstract

This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models--surrogates for RNA functional properties--into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, we fine-tune the diffusion model on mRNA 5' untranslated regions (5'-UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-off. Our findings hold potential for advancing RNA sequence-function research and therapeutic RNA design.
Paper Structure (32 sections, 5 equations, 8 figures, 4 tables)

This paper contains 32 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: RNAdiffusion : Latent diffusion model for RNA sequences. Three parts of RNAdiffusion : (1) RNA sequence auto-encoder, consisting of a pretrained RNA-FM model, a Querying Transformer, and a decoder, for translating between the sequence space and the latent space; (2) Guided diffusion model with a pre-trained score network, for generating latent RNA embeddings under external guidance; (3) Latent reward model, trained on the latent space to predict functional properties of RNA, for computing guidance of diffusion.
  • Figure 2: Sequence length comparison between the natural ncRNA test set and generated sequences (sample size: 20000).
  • Figure 3: Generated samples from RNAdiffusion compared to natural ncRNAs and random sequences. Each includes 9000 sequences. (a) minimum 4-mer distances. (b) minimum sequence Levenshtein distances. (c) G/C content ratios. (d) minimum free energy. (e) minimum Levenshtein distances of RNA secondary structure. (f) t-SNE visualizations of RNAs in latent embedding space.
  • Figure 4: Sequence length histogram of the natural UTR test set and generated sequences (sample size: 20000).
  • Figure 5: Pareto front curves between Minimum Free Energy (MFE) and Mean Ribosome Loading (MRL)/Translation Efficiency (TE). The curves are generated by selecting the top 10% quantile MRL/TE within sliding windows around each MFE value. Each shaded dot denotes a sequence. All the sequences are evaluated with the same validation reward models.
  • ...and 3 more figures