Beyond Sequence: Impact of Geometric Context for RNA Property Prediction
Junjie Xu, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao
TL;DR
This work addresses the challenge of predicting RNA properties by evaluating how explicit geometric context (2D and 3D) impacts model performance relative to traditional 1D sequence models. It introduces a curated suite of RNA datasets with enhanced 2D/3D annotations and a unified benchmark spanning 1D, 2D, and 3D representations, assessing accuracy, data efficiency, partial labeling, noise robustness, and generalization. Key findings show an average RMSE reduction of around $12\%$ when incorporating geometry, with 2D models especially effective in low-data and partially labeled scenarios; 3D models offer gains in data-sparse settings but suffer from receptive-field limits and noise from structure predictors, while geometry-free 1D models are more robust to sequencing noise but require $2-5\times$ more training data to reach geometry-aware performance. The results illuminate trade-offs between RNA representations, suggesting ensemble strategies and improved 3D architectures to harness geometric context while mitigating prediction noise. The study advances practical guidance for selecting RNA modeling approaches under real-world data constraints and paves the way for more robust multi-view RNA property prediction.
Abstract
Accurate prediction of RNA properties, such as stability and interactions, is crucial for advancing our understanding of biological processes and developing RNA-based therapeutics. RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models, each offering different insights into its function. Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries. This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. On the other hand, geometry-unaware sequence-based models are more robust under sequencing noise but often require around $2-5\times$ training data to match the performance of geometry-aware models. Our study offers further insights into the trade-offs between different RNA representations in practical applications and addresses a significant gap in evaluating deep learning models for RNA tasks.
