Predicting Distance matrix with large language models

Jiaxing Yang

Predicting Distance matrix with large language models

Jiaxing Yang

TL;DR

This work demonstrates that using only primary sequence information, it can accurately infer the distances between RNA bases by utilizing a large pretrained RNA language model coupled with a well trained downstream transformer.

Abstract

Structural prediction has long been considered critical in RNA research, especially following the success of AlphaFold2 in protein studies, which has drawn significant attention to the field. While recent advances in machine learning and data accumulation have effectively addressed many biological tasks, particularly in protein related research. RNA structure prediction remains a significant challenge due to data limitations. Obtaining RNA structural data is difficult because traditional methods such as nuclear magnetic resonance spectroscopy, Xray crystallography, and electron microscopy are expensive and time consuming. Although several RNA 3D structure prediction methods have been proposed, their accuracy is still limited. Predicting RNA structural information at another level, such as distance maps, remains highly valuable. Distance maps provide a simplified representation of spatial constraints between nucleotides, capturing essential relationships without requiring a full 3D model. This intermediate level of structural information can guide more accurate 3D modeling and is computationally less intensive, making it a useful tool for improving structural predictions. In this work, we demonstrate that using only primary sequence information, we can accurately infer the distances between RNA bases by utilizing a large pretrained RNA language model coupled with a well trained downstream transformer.

Predicting Distance matrix with large language models

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 7 figures, 3 tables)

This paper contains 13 sections, 4 equations, 7 figures, 3 tables.

Introduction
Methods
RNA Bidirectional Language Model
Distance Transformer
DiT Pre-Training
Distance matrix Tuning
DiT Self-Training
Results
Distance Prediction Accuracy
3D-Structure Evaluation
RNA Contact Evluation
Large Model Hurts Performance
Conclusions

Figures (7)

Figure 1: Overview of our whole framework. The large scale RNA Language Model phase served as a ‘green box’ providing us with trained embedding layer. Right half presents the DiT architecture pre/self-training stage.
Figure 2: Illustration of what DiT is learning from distance data. The learned feature matrix is a decomposition $B$ of a distance matrix $X$.
Figure 3: Self-training stage where we iteratively generates pseudo labels of unlabeled data and re-trains the model from distance tuning combining those pseudo data.
Figure 4: Visualization of predicting results on four RNA structures 5wt1,4y1m,4gxy,3jb9. DiT-PS can generate almost equivalent patterns on complicated structures.
Figure 5: Left part presents the scatterplot comparison of DiT-PS to other alternatives, almost all data points stayed from the diagonal reflecting we outperforms other alternatives with a preferable gap. Right part suggests that DiT-PS performance drops when structure complexity grows, but we're still better that convolutional-based Unet++.
...and 2 more figures

Predicting Distance matrix with large language models

TL;DR

Abstract

Predicting Distance matrix with large language models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)