Table of Contents
Fetching ...

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

Junjie Xu, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao

TL;DR

This work addresses the challenge of predicting RNA properties by evaluating how explicit geometric context (2D and 3D) impacts model performance relative to traditional 1D sequence models. It introduces a curated suite of RNA datasets with enhanced 2D/3D annotations and a unified benchmark spanning 1D, 2D, and 3D representations, assessing accuracy, data efficiency, partial labeling, noise robustness, and generalization. Key findings show an average RMSE reduction of around $12\%$ when incorporating geometry, with 2D models especially effective in low-data and partially labeled scenarios; 3D models offer gains in data-sparse settings but suffer from receptive-field limits and noise from structure predictors, while geometry-free 1D models are more robust to sequencing noise but require $2-5\times$ more training data to reach geometry-aware performance. The results illuminate trade-offs between RNA representations, suggesting ensemble strategies and improved 3D architectures to harness geometric context while mitigating prediction noise. The study advances practical guidance for selecting RNA modeling approaches under real-world data constraints and paves the way for more robust multi-view RNA property prediction.

Abstract

Accurate prediction of RNA properties, such as stability and interactions, is crucial for advancing our understanding of biological processes and developing RNA-based therapeutics. RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models, each offering different insights into its function. Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries. This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. On the other hand, geometry-unaware sequence-based models are more robust under sequencing noise but often require around $2-5\times$ training data to match the performance of geometry-aware models. Our study offers further insights into the trade-offs between different RNA representations in practical applications and addresses a significant gap in evaluating deep learning models for RNA tasks.

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

TL;DR

This work addresses the challenge of predicting RNA properties by evaluating how explicit geometric context (2D and 3D) impacts model performance relative to traditional 1D sequence models. It introduces a curated suite of RNA datasets with enhanced 2D/3D annotations and a unified benchmark spanning 1D, 2D, and 3D representations, assessing accuracy, data efficiency, partial labeling, noise robustness, and generalization. Key findings show an average RMSE reduction of around when incorporating geometry, with 2D models especially effective in low-data and partially labeled scenarios; 3D models offer gains in data-sparse settings but suffer from receptive-field limits and noise from structure predictors, while geometry-free 1D models are more robust to sequencing noise but require more training data to reach geometry-aware performance. The results illuminate trade-offs between RNA representations, suggesting ensemble strategies and improved 3D architectures to harness geometric context while mitigating prediction noise. The study advances practical guidance for selecting RNA modeling approaches under real-world data constraints and paves the way for more robust multi-view RNA property prediction.

Abstract

Accurate prediction of RNA properties, such as stability and interactions, is crucial for advancing our understanding of biological processes and developing RNA-based therapeutics. RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models, each offering different insights into its function. Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries. This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. On the other hand, geometry-unaware sequence-based models are more robust under sequencing noise but often require around training data to match the performance of geometry-aware models. Our study offers further insights into the trade-offs between different RNA representations in practical applications and addresses a significant gap in evaluating deep learning models for RNA tasks.

Paper Structure

This paper contains 60 sections, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Overview of the study. (a) Left panel: RNA sequences represented in 1D, 2D, and 3D structures, processed by 1D sequence, 2D GNN, and 3D GNN models. Our analysis includes prediction error, robustness and generalization to sequencing noise, and performance under limited training data and partial labelings. (b) Right panel: Comparative performance of 1D, 2D, and 3D methods across experimental conditions. Histograms show RMSE performance, relative RMSE changes with increasing noise, and data requirements for optimal performance. Lower values indicate better performance in all metrics.
  • Figure 2: Performance vs. fraction of training data across various datasets. Model performance improves with increasing data, with lower MCRMSE across all models. 2D models consistently outperform 1D models, particularly in low-data regimes, underscoring the value of structural information for generalization. Dotted, solid, and dashed lines denote 1D, 2D, and 3D methods, respectively, which applies consistently throughout all figures in this paper.
  • Figure 3: Performance vs. partial property labels on COVID and Ribonanza datasets. 2D models consistently outperform 1D models with sparse labeling, while Transformer1D and Transformer1D2D improve rapidly with denser supervision, emphasizing the need for more labels in transformer-based models.
  • Figure 4: Visualization of 1D, 2D, and 3D structures under varying noise ratios (mutation errors during sequencing). Each column represents a different noise ratio, showcasing the impact of noise on the structures across different dimensions.
  • Figure 5: Robustness experiments. Transformer1D shows the least performance drop under increasing noise, maintaining the highest accuracy, with Transformer1D2D following closely. In contrast, 2D and 3D models, particularly ChebNet and 3D models, are more impacted by noise.
  • ...and 6 more figures