Table of Contents
Fetching ...

Fair splits flip the leaderboard: CHANRG reveals limited generalization in RNA secondary-structure prediction

Zhiyuan Chen, Zhenfeng Deng, Pan Deng, Yue Liao, Xiu Su, Peng Ye, Xihui Liu

Abstract

Accurate prediction of RNA secondary structure underpins transcriptome annotation, mechanistic analysis of non-coding RNAs, and RNA therapeutic design. Recent gains from deep learning and RNA foundation models are difficult to interpret because current benchmarks may overestimate generalization across RNA families. We present the Comprehensive Hierarchical Annotation of Non-coding RNA Groups (CHANRG), a benchmark of 170{,}083 structurally non-redundant RNAs curated from more than 10 million sequences in Rfam~15.0 using structure-aware deduplication, genome-aware split design and multiscale structural evaluation. Across 29 predictors, foundation-model methods achieved the highest held-out accuracy but lost most of that advantage out of distribution, whereas structured decoders and direct neural predictors remained markedly more robust. This gap persisted after controlling for sequence length and reflected both loss of structural coverage and incorrect higher-order wiring. Together, CHANRG and a padding-free, symmetry-aware evaluation stack provide a stricter and batch-invariant framework for developing RNA structure predictors with demonstrable out-of-distribution robustness.

Fair splits flip the leaderboard: CHANRG reveals limited generalization in RNA secondary-structure prediction

Abstract

Accurate prediction of RNA secondary structure underpins transcriptome annotation, mechanistic analysis of non-coding RNAs, and RNA therapeutic design. Recent gains from deep learning and RNA foundation models are difficult to interpret because current benchmarks may overestimate generalization across RNA families. We present the Comprehensive Hierarchical Annotation of Non-coding RNA Groups (CHANRG), a benchmark of 170{,}083 structurally non-redundant RNAs curated from more than 10 million sequences in Rfam~15.0 using structure-aware deduplication, genome-aware split design and multiscale structural evaluation. Across 29 predictors, foundation-model methods achieved the highest held-out accuracy but lost most of that advantage out of distribution, whereas structured decoders and direct neural predictors remained markedly more robust. This gap persisted after controlling for sequence length and reflected both loss of structural coverage and incorrect higher-order wiring. Together, CHANRG and a padding-free, symmetry-aware evaluation stack provide a stricter and batch-invariant framework for developing RNA structure predictors with demonstrable out-of-distribution robustness.
Paper Structure (9 sections, 5 figures)

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: Overview of the CHANRG benchmark.a, Architecture--clan--family hierarchy of CHANRG with outer split occupancy across families. b, Structural similarity of CHANRG evaluation splits and widely used legacy benchmark sets relative to the CHANRG training set, showing reduced overlap under CHANRG. c, Sequence and family counts for CHANRG, bpRNA and ArchiveII-derived resources. d, Overall sequence-length distribution of CHANRG, highlighting the strongly long-tailed range of benchmarked RNAs. e, Multistage curation funnel from raw Rfam 15.0 sequences to the final structurally non-redundant benchmark. f, Biological rationale of held-out and out-of-distribution split regimes. g, First benchmark overview across representative methods, splits and structural metrics, showing that held-out leaders do not retain the same advantage across CHANRG out-of-distribution regimes.
  • Figure 2: Standard held-out leaderboards overestimate generalization.a, Test versus OOD$_{\mathrm{mean}}$ base-pair $F_1$ for the canonical 17-model cohort. Each point is one model, colored by predictor class (FM, SD, or DL). Dashed guide lines indicate equal performance and 75%, 50%, and 25% retention relative to Test. b, Heatmap of base-pair $F_1$ across Test, GenA, GenC, and GenF, together with OOD$_{\mathrm{mean}}$ and retention (OOD$_{\mathrm{mean}}$/Test), for the canonical cohort. Held-out leaders lose their advantage across CHANRG OOD regimes. c, Class-level base-pair $F_1$ on Test and OOD$_{\mathrm{mean}}$. Points denote individual models and error bars denote 95% bootstrap confidence intervals over per-model means. d, Per-sequence base-pair $F_1$ distributions across Test, GenA, GenC, and GenF for six representative methods: MXFold2 and EternaFold (SD), SPOT-RNA and BPfold (DL), and RiNALMo-Giga and ERNIE-RNA (FM). OOD$_{\mathrm{mean}}$ denotes the unweighted mean across GenA, GenC, and GenF.
  • Figure 3: The CHANRG generalization gap is not explained by sequence length or model scale.a, Split-specific sequence-length distributions (log-scale x-axis) show that GenA is longer than Test, whereas GenC and GenF are shorter, ruling out a simple "OOD = longer RNAs" explanation. b, Base-pair $F_1$ heatmap for the canonical 17-model cohort after restricting all splits to RNAs between 50 and 200 nt, with OOD$_{\mathrm{mean}}$ and retention columns. c, Class-level base-pair $F_1$ on the length-matched subset, showing that the held-out/OOD inversion persists after controlling the evaluated length range. Points denote individual models and error bars denote 95% bootstrap confidence intervals over per-model means. d, Change in base-pair $F_1$ after length matching ($F_1^{\mathrm{matched}} - F_1^{\mathrm{full}}$) for six representative methods. Length control improves several structured and direct neural baselines on GenA but does not rescue foundation-model transfer on GenC or GenF. e, Scaling within the RiNALMo-U-Net family improves Test performance much more than OOD performance for both base-pair $F_1$ and topology $F_1$. Horizontal reference lines indicate BPfold, EternaFold, and RNAfold on the corresponding metric. OOD$_{\mathrm{mean}}$ denotes the unweighted mean across GenA, GenC, and GenF.
  • Figure 4: Hierarchical evaluation reveals coverage failure and helix miswiring.a, Class-level stem $F_1$, topology $F_1$, and topology GED on Test and OOD$_{\mathrm{mean}}$. Points denote individual models and error bars denote 95% bootstrap confidence intervals over per-model means. Lower topology GED indicates better agreement. b, Precision--recall shift from Test to OOD$_{\mathrm{mean}}$ for structured decoders (SD), direct neural predictors (DL), and foundation-model predictors (FM). Gray contours indicate iso-$F_1$ values. c, Retention relative to Test for stem $F_1$, hairpin-loop $F_1$, internal-loop $F_1$, multiloop $F_1$, external-loop $F_1$, and topology $F_1$, stratified by method class. Points denote individual models. d, Motif-specific retention (OOD$_{\mathrm{mean}}$/Test) for six representative methods: MXFold2 and EternaFold (SD), SPOT-RNA and BPfold (DL), and RiNALMo-Giga and ERNIE-RNA (FM). e, Two pre-specified case studies localizing OOD errors to global loop--helix organization. Left, Case 1 (GenA, 109 nt; RF01527; LMFL01000010.1_72055-71947) shows moderate contact recovery but incorrect global architecture for RiNALMo-Giga. Right, Case 2 (GenC, 98 nt; RF03162; AE017226.1_854285-854382) shows perfect helix recovery but failed multiloop wiring. Target pairs, true positives (TP), false positives (FP), and false negatives (FN) are shown for each prediction. OOD$_{\mathrm{mean}}$ denotes the unweighted mean across GenA, GenC, and GenF.
  • Figure 5: Padding-free, symmetry-aware computation enables faithful CHANRG-scale evaluation.a, Comparison of conventional padded-tensor execution and the NestedTensor reference implementation. In conventional dense execution, shorter RNAs are embedded in a larger $L_{\max}\times L_{\max}$ contact map, so convolution mixes valid and padded regions and symmetry enforcement by taking the upper triangle discards half of the dense output. In the NestedTensor path, padded positions are excluded from the computation graph and symmetry is enforced at the output level by averaging the predicted contact map with its transpose, $o=(y+y^{\top})/2$. b, Contact-map padding ratio across real evaluation-set batch contexts from Test, GenA, GenC, and GenF, stratified by batch size. c, Flipped candidate-pair fraction under dense padded inference across batch sizes for Test, GenA, GenC, and GenF, showing that batch-context dependence persists even at batch size 2. d, Controlled systems benchmark on synthetic batches spanning representative sequence lengths. Relative to dense padded tensors, the NestedTensor implementation reduces inference latency by 3.3-fold and allocated GPU memory by 6.7-fold. Panels b and c use real evaluation-set contexts, whereas panel d uses synthetic batches to isolate systems cost.