Table of Contents
Fetching ...

RNACG: A Universal RNA Sequence Conditional Generation model based on Flow-Matching

Letian Gao, Zhi John Lu

TL;DR

RNACG tackles universal RNA sequence design under limited structural data by unifying generation with flow matching and Dirichlet modeling. It supports multiple conditioning signals through a modular Condition Encoder and a Diffusion Transformer backbone, enabling family-specific generation, 3D inverse folding, and property prediction tasks. Across RNA families, 3D inverse folding benchmarks, and 5'UTR translation efficiency prediction, RNACG achieves competitive or state-of-the-art performance with a compact model (~4.5M parameters). This framework offers a versatile tool for RNA design with potential applications in synthetic biology, therapeutics, and vaccine development.

Abstract

RNA plays a pivotal role in diverse biological processes, ranging from gene regulation to catalysis. Recent advances in RNA design, such as RfamGen, Ribodiffusion and RDesign, have demonstrated promising results, with successful designs of functional sequences. However, RNA design remains challenging due to the inherent flexibility of RNA molecules and the scarcity of experimental data on tertiary and secondary structures compared to proteins. These limitations highlight the need for a more universal and comprehensive approach to RNA design that integrates diverse annotation information at the sequence level. To address these challenges, we propose RNACG (RNA Conditional Generator), a universal framework for RNA sequence design based on flow matching. RNACG supports diverse conditional inputs, including structural, functional, and family-specific annotations, and offers a modular design that allows users to customize the encoding network for specific tasks. By unifying sequence generation under a single framework, RNACG enables the integration of multiple RNA design paradigms, from family-specific generation to tertiary structure inverse folding.

RNACG: A Universal RNA Sequence Conditional Generation model based on Flow-Matching

TL;DR

RNACG tackles universal RNA sequence design under limited structural data by unifying generation with flow matching and Dirichlet modeling. It supports multiple conditioning signals through a modular Condition Encoder and a Diffusion Transformer backbone, enabling family-specific generation, 3D inverse folding, and property prediction tasks. Across RNA families, 3D inverse folding benchmarks, and 5'UTR translation efficiency prediction, RNACG achieves competitive or state-of-the-art performance with a compact model (~4.5M parameters). This framework offers a versatile tool for RNA design with potential applications in synthetic biology, therapeutics, and vaccine development.

Abstract

RNA plays a pivotal role in diverse biological processes, ranging from gene regulation to catalysis. Recent advances in RNA design, such as RfamGen, Ribodiffusion and RDesign, have demonstrated promising results, with successful designs of functional sequences. However, RNA design remains challenging due to the inherent flexibility of RNA molecules and the scarcity of experimental data on tertiary and secondary structures compared to proteins. These limitations highlight the need for a more universal and comprehensive approach to RNA design that integrates diverse annotation information at the sequence level. To address these challenges, we propose RNACG (RNA Conditional Generator), a universal framework for RNA sequence design based on flow matching. RNACG supports diverse conditional inputs, including structural, functional, and family-specific annotations, and offers a modular design that allows users to customize the encoding network for specific tasks. By unifying sequence generation under a single framework, RNACG enables the integration of multiple RNA design paradigms, from family-specific generation to tertiary structure inverse folding.
Paper Structure (20 sections, 38 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 20 sections, 38 equations, 4 figures, 3 tables, 3 algorithms.

Figures (4)

  • Figure 1: Overview of the RNACG workflow.
  • Figure 2: Evaluation of Sequence Generation Methods Across RNA Families. The figure compares the performance of RNACG (with and without secondary structure constraints), RfamGen, and cmemit across three RNA families (RF00001, RF00002, RF00005). Scores are plotted against time steps, demonstrating the improvement in sequence quality during generation. RNACG w/o ss (without secondary structure constraints) shows significant score improvements over time, while RNACG w/ ss (with secondary structure constraints) exhibits minimal improvement. Both RNACG and RfamGen generally outperform cmemit, highlighting the effectiveness of class embedding for family-specific sequence generation.
  • Figure 3: Evaluation of Inverse Folding Methods Across RNA Families on F1 Score. The figure compares the performance of RNACG and RiboDiffusion across three RNA families (RF00001, RF00002, RF00005). For each violin plot, the red half split represents the RNACG performance, while the blue half split represents the Ribodiffusion performance. And here shows Recovery Rate and F1 Score from left to right.
  • Figure 4: Evaluation of Sequence Generation Methods Across RNA families based cmsearch. The figure compares the performance of RNACG (Inv3Dflow, which based on 3D structure), RiboDiffusion, RfamGen, cmemit and RNACG (Rfamflow, which based on classifier-guided) across three RNA families (RF00001, RF00002, RF00005).