Bridging Sequence-Structure Alignment in RNA Foundation Models

Heng Yang; Renzhi Chen; Ke Li

Bridging Sequence-Structure Alignment in RNA Foundation Models

Heng Yang, Renzhi Chen, Ke Li

TL;DR

This work tackles the lack of sequence-structure alignment in RNA foundation models by introducing OmniGenome, a structure-contextualised Transformer that supports bidirectional Seq2Str and Str2Seq mappings. It adopts a SN-level RNA tokenization and three pre-training objectives (Str2Seq, Seq2Str, MRLM) trained on OneKP plant transcriptomes, achieving state-of-the-art results on RGB and PGB benchmarks and enabling zero-shot secondary structure prediction as well as efficient RNA design. Key results include up to 74% RNA design puzzle solved on EternaV2 and macro-F1 scores up to 75 in zero-shot SSP, surpassing ViennaRNA and other baselines in many settings. The work also provides an open-source OmniGenome toolkit with tutorials and automated benchmarks, underscoring the practical impact for multi-species RNA genome modelling and downstream genomic tasks.

Abstract

The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the free flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures based on structure-contextualised modelling. The alignment enables free and bidirectional mappings between sequences and structures by utilising the flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only solved up to 3% of the puzzles due to the oversight of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of genome downstream tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.

Bridging Sequence-Structure Alignment in RNA Foundation Models

TL;DR

Abstract

Paper Structure (42 sections, 1 equation, 6 figures, 15 tables)

This paper contains 42 sections, 1 equation, 6 figures, 15 tables.

Introduction
Sequence-Structure Alignment in GFMs
Str2Seq Mapping
Seq2Str Mapping
Evaluations and Results
Open-source Toolkit and Tutorials
Methodology
RNA Tokenization for Alignment
Pre-training Objectives
Model Architecture
Pre-training Database: OneKP
Benchmark Suites
RNA Genomic Benchmark (RGB)
Plant Genomic Benchmark (PGB)
Str2Seq Modelling Case: RNA Design
...and 27 more sections

Figures (6)

Figure 1: An example for in-silico RNA folding drawn by ViennaRNA. The subfigures (a) and (c) indicate the same sequence with different structures. The subfigures (b) and (c) denote the identical structure can be from different sequences.
Figure 2: A virtual example of structure-contextualised sequence reconstruction. The top subfigure indicates that we need to expand the vocabulary for structure-aware tokenization. Otherwise, the structure cannot be recognised, i.e., unknown as "?". We show our structure-contextualised modelling (Str2Seq) in the bottom sub-figure, where the 'M' indicates the masked tokens to be reconstructed by OmniGenome.
Figure 3: An illustrative example of RNA tokenization. The left sub-figure shows that k-mers and BPE entangle the bases and fail to align the SN-level inputs and outputs. The right sub-figure denotes that only SNT can achieve sequence-structure alignment, such as Seq2Str prediction.
Figure 4: The workflow of OmniGenome pre-training. We craft the inputs for three pre-training objectives described in sec:objectives. The outputs are reconstructed sequences based on the context of structure, predicted secondary structure, and unmasked sequences, respectively. The predictions of shadowed tokens are not calculated in the objective functions.
Figure 5: The genetic algorithm used for solving RNA design tasks. 'M' and A are abbreviations for the mask token and the predicted bases in this mutation operation, respectively. The most effective component in this algorithm is the structure-based sequence reconstruction based on OmniGenome$+$.
...and 1 more figures

Bridging Sequence-Structure Alignment in RNA Foundation Models

TL;DR

Abstract

Bridging Sequence-Structure Alignment in RNA Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)