Towards Optimal Grammars for RNA Structures

Evarista Onokpasa; Sebastian Wild; Prudence W. H. Wong

Towards Optimal Grammars for RNA Structures

Evarista Onokpasa, Sebastian Wild, Prudence W. H. Wong

TL;DR

This work develops an automated framework to search for optimal stochastic grammars (SRF) that model RNA sequence-structure data for joint compression and ab initio structure prediction. By combining exhaustive search on small SRF grammars with a random-search component and a stochastic RNA Form normal form, the authors demonstrate that a subset of grammars can surpass human-expert designs in compression efficiency. The study provides reference implementations and shows that automatic grammar discovery can yield better-than-expert grammars, motivating an open contest for optimal RNA grammars. Overall, the approach advances compression-driven RNA structure modeling and suggests promising directions for scalable, learning-based grammar discovery.

Abstract

In past work (Onokpasa, Wild, Wong, DCC 2023), we showed that (a) for joint compression of RNA sequence and structure, stochastic context-free grammars are the best known compressors and (b) that grammars which have better compression ability also show better performance in ab initio structure prediction. Previous grammars were manually curated by human experts. In this work, we develop a framework for automatic and systematic search algorithms for stochastic grammars with better compression (and prediction) ability for RNA. We perform an exhaustive search of small grammars and identify grammars that surpass the performance of human-expert grammars.

Towards Optimal Grammars for RNA Structures

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 4 figures, 2 tables)

This paper contains 17 sections, 3 equations, 4 figures, 2 tables.

Introduction
Preliminaries
RNA as strings
Context-free Grammars
Stochastic Context-free Grammars
SCFG as probabilistic models
Derivations as representations
Probabilistic Parsing
SCFG-based Joint RNA Compression
Rule-probability models
Stochastic RNA Normal Form for Grammars
Methodology and Results
Exhaustive exploration
Distribution of compression ability
Random search
...and 2 more sections

Figures (4)

Figure 1: An example RNA sequence and structure. Left: schematic drawing of structure. Above: Representation as dot-bracket sequence when the backbone is "pulled straight".
Figure 2: Normalized average compressed size (in bits per base) for all grammars of the given size (#NTs (Nonterminals), #rules) on 10% sample of the "benchmark" dataset from DowellEddy2004. Each dot is one grammar; the $x$-coordinate is using the static rule-probability model, with rule counts on the same dataset; the $y$-coordinate uses the adaptive rule-probability model.
Figure 3: Histogram of the normalized average compressed size (in bits per base) for all grammars from \ref{['fig:exhaustive-by-size']} on the 10% subsample of the benchmark dataset from Dowell and Eddy DowellEddy2004 using the adaptive (left) resp. static (right) rule-probability model.
Figure 4: Newly identified grammars; $G^*_{k,r}$ is the best grammar with $k$ NTs (Nonterminals) and $r$ rules from exhaustive search; $G^\dag_{k,r}$ is the best grammar we found with random search.

Towards Optimal Grammars for RNA Structures

TL;DR

Abstract

Towards Optimal Grammars for RNA Structures

Authors

TL;DR

Abstract

Table of Contents

Figures (4)