Bridging Sequence-Structure Alignment in RNA Foundation Models
Heng Yang, Renzhi Chen, Ke Li
TL;DR
This work tackles the lack of sequence-structure alignment in RNA foundation models by introducing OmniGenome, a structure-contextualised Transformer that supports bidirectional Seq2Str and Str2Seq mappings. It adopts a SN-level RNA tokenization and three pre-training objectives (Str2Seq, Seq2Str, MRLM) trained on OneKP plant transcriptomes, achieving state-of-the-art results on RGB and PGB benchmarks and enabling zero-shot secondary structure prediction as well as efficient RNA design. Key results include up to 74% RNA design puzzle solved on EternaV2 and macro-F1 scores up to 75 in zero-shot SSP, surpassing ViennaRNA and other baselines in many settings. The work also provides an open-source OmniGenome toolkit with tutorials and automated benchmarks, underscoring the practical impact for multi-species RNA genome modelling and downstream genomic tasks.
Abstract
The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the free flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures based on structure-contextualised modelling. The alignment enables free and bidirectional mappings between sequences and structures by utilising the flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only solved up to 3% of the puzzles due to the oversight of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of genome downstream tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.
