Table of Contents
Fetching ...

RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow

Dana Rubin, Allan dos Santos Costa, Manvitha Ponnapati, Joseph Jacobson

TL;DR

The paper addresses the challenge of jointly generating RNA sequence and all-atom 3D structure. It introduces RiboGen, a Multiflow framework that combines continuous Flow Matching for coordinates with discrete Flow Matching for sequences, implemented via Euclidean-Equivariant networks to model 3D geometry. Empirical results show chemical-valid backbone and base geometry, with competitive self-consistency (TM-score) across a range of sequence lengths and efficient one-shot co-generation compared to prior methods. The work highlights the potential of sequence-structure co-generation to accelerate RNA design and optimization tasks.

Abstract

Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples, suggesting that co-generation of sequence and structure is a competitive approach for modeling RNA.

RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow

TL;DR

The paper addresses the challenge of jointly generating RNA sequence and all-atom 3D structure. It introduces RiboGen, a Multiflow framework that combines continuous Flow Matching for coordinates with discrete Flow Matching for sequences, implemented via Euclidean-Equivariant networks to model 3D geometry. Empirical results show chemical-valid backbone and base geometry, with competitive self-consistency (TM-score) across a range of sequence lengths and efficient one-shot co-generation compared to prior methods. The work highlights the potential of sequence-structure co-generation to accelerate RNA design and optimization tasks.

Abstract

Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples, suggesting that co-generation of sequence and structure is a competitive approach for modeling RNA.

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: RNA Sequence and Structure Co-Generation: (a) Traditional molecular structure showing the nucleotides with atoms and bonds. Right side demonstrates how each nucleotide (G, A, C, U) is represented as both a discrete sequence element (colored boxes) and associated 3D point cloud representation (colored directional features) centered around the $C3'$ atom. (b) The RiboGen model architechture: the model takes noised input of sequence and geometric features $\mathbf R_t$, and a time parameter $t$, process them through the base network and simultaneously predicts three components: the RNA sequence, central coordinates, and 3D features. These components are combined to produce the final RNA structure prediction $\hat{\mathbf R}_1$.
  • Figure 2: Multiflow for RNA Sequence, Backbone and Atomistic Structure: (a) Schematic representation of our Multiflow approach, demonstrating the three dimensions- sequence, coordinates, and features. (b) Visualization of the RNA structure generation across multiple time steps. (c) Visualization of the Discrete flow matching used for sequence prediction in the model, where each color represents a different nucleotide. (d) Final product, a complete generated RNA molecule.
  • Figure 3: RiboGen Chemical Analysis: Distribution comparison of key RNA geometric parameters between the training dataset and 50 random samples of RiboGen generations across all lengths. (5 from each length) The analyzed parameters include alpha, beta, gamma, chi dihedral angles, and ribose puckering phase, which are strong indicators of RNA backbone and chemical validity.
  • Figure 4: Self-consistency Visualization of RiboGen's Joint Sequence-Structure Generation Aligned with Boltz Structure: RiboGen-generated RNA structures (green) aligned with Boltz structure predictions (blue) derived from the corresponding co-generated sequences of RiboGen. Six examples across different sequence lengths demonstrate varying degrees of structural agreement. Notably, in some cases (c, f) RiboGen generates fragmented or unfolded structures, suggesting failure modes in the sampling process for long or structurally complex sequences.
  • Figure 5: Self-Consistency Evaluation: (a) RMSD and (b) TM-score between our generated structures and Boltz-1 predictions across different sequence lengths (40-150 nucleotides), showing the top 10 generated structures for each length. The TM-score ranges from 0 to 1, with higher values indicating better structural agreement, while lower RMSD values indicate better structural similarity. (c) Median of TM-scores of top 10 generated structures: The plot compares RiboGen's and FrameFlow's medians across various RNA sequence lengths and illustrates that RiboGen achieves higher TM-scores for RNA sequences between 70-150 nucleotides, excluding 120 which has similar median. FrameFlow demonstrates comparable performance for shorter sequences but shows decreased structural accuracy as sequence length increases.
  • ...and 1 more figures