CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

Rong Han; Xiaohong Liu; Tong Pan; Jing Xu; Xiaoyu Wang; Wuyang Lan; Zhenyu Li; Zixuan Wang; Jiangning Song; Guangyu Wang; Ting Chen

CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li, Zixuan Wang, Jiangning Song, Guangyu Wang, Ting Chen

TL;DR

CoPRA tackles the challenge of predicting protein-RNA binding affinity by bridging cross-domain pretrained language models for proteins and RNAs with explicit complex-structure information. A Co-Former fuses interface sequence embeddings and a structure-derived pair representation, guided by bi-scope pre-training with CPRI and MIDM on the PRI30k dataset. The approach yields state-of-the-art performance on PRA310 and PRA201 for ΔG prediction and robust mutation-effect predictions, demonstrating strong generalization and scalability. This cross-domain, structure-aware framework provides a blueprint for extending high-precision affinity predictions to broader biomolecular interactions and mutation analyses, particularly as dataset sizes grow and model scales increase.

Abstract

Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former's interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.

CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

TL;DR

Abstract

Paper Structure (69 sections, 4 equations, 7 figures, 8 tables, 2 algorithms)

This paper contains 69 sections, 4 equations, 7 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Protein-RNA Binding Affinity Prediction
Protein and RNA Language Models
Multi-Modal Learning in Language Models
Methods
CoPRA overview
Notations of the protein-RNA complex
Protein.
RNA.
Protein-RNA complex.
Protein-RNA interface representation
Interface sequence embedding.
Interface structure extraction.
Co-Former
...and 54 more sections

Figures (7)

Figure 1: CoPRA combines Protein and RNA language models with structure information by pre-training on bi-scope tasks with different special embeddings. CPRI: Contrastive Protein-RNA interaction modeling; $\Delta$G/$\Delta\Delta$G: binding affinity/binding affinity change; MIDM: Mask interface distance modeling. The dashed line represents that they are downstream affinity prediction tasks.
Figure 2: Overview of CoPRA. Given a protein-RNA complex as input, the sequence information of protein and RNA are fed into a PLM and an RLM, respectively. The output embeddings are selective with interface information and are fed into Co-Former with pairwise information. The Co-Former fuses the 1D and pair embedding by structure-guided multi-head attention and outer product modules, with a task-dependent attention mask. The output special nodes and pair embedding of Co-Former are employed dependent on different tasks, including two pre-training tasks and two downstream affinity tasks. CN, PN, and RN are the special nodes for complex, protein, and RNA, respectively.
Figure 3: Ablation study of ESM-2 model size.
Figure 4: Different task masks.
Figure 5: An example of complex before and after filtering.
...and 2 more figures

CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

TL;DR

Abstract

CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)