Table of Contents
Fetching ...

Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

Mingyu Huang, Shasha Zhou, Ke Li

TL;DR

GraphFLA introduces a scalable Python framework to augment sequence-fitness benchmarks with 20 biologically motivated landscape features that describe ruggedness, navigability, epistasis, and neutrality. By constructing landscapes from diverse mutagenesis data and compiling 155 combinatorially complete datasets, it enables landscape-aware interpretation and comparison of fitness prediction methods across proteins, RNAs, and DNAs. Empirical results show landscape topology strongly shapes model performance and justify using topology-aware benchmarks for model selection and evaluation. The approach, with open-source code and datasets, is poised to improve understanding of model limitations and guide design choices in fitness landscape modeling and directed evolution.

Abstract

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.

Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

TL;DR

GraphFLA introduces a scalable Python framework to augment sequence-fitness benchmarks with 20 biologically motivated landscape features that describe ruggedness, navigability, epistasis, and neutrality. By constructing landscapes from diverse mutagenesis data and compiling 155 combinatorially complete datasets, it enables landscape-aware interpretation and comparison of fitness prediction methods across proteins, RNAs, and DNAs. Empirical results show landscape topology strongly shapes model performance and justify using topology-aware benchmarks for model selection and evaluation. The approach, with open-source code and datasets, is poised to improve understanding of model limitations and guide design choices in fitness landscape modeling and directed evolution.

Abstract

Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.

Paper Structure

This paper contains 42 sections, 36 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of how GraphFLA contributes to the performance benchmarking of fitness prediction models. Existing biological fitness prediction benchmarks (b) are often curated from empirical fitness landscape datasets without interrogating landscape topography (a). GraphFLA constructs these landscapes and offers a comprehensive suite of features characterizing their topography (c). Such landscape features can then augment existing benchmarks (d) and thus assist performance interpretation (e, upper) and comparison (e, lower).
  • Figure 2: GraphFLA scales efficiently and captures influential landscape features for model performance.(a) Runtime (left) and peak memory usage (right) during fitness landscape construction for GraphFLA, MAGELLAN, and a community implementation PapkouRM23, as a function of landscape size. Landscapes were generated using the NK model Kauffman93 by varying the number of loci N from $5$ to $20$ ($\to$ landscape sizes from $2^5$ to $2^{20}$). Results shown are averages across $10$ replicates. (b) Distribution of $3$ representative landscape features across $155$ combinatorially complete landscapes collected in GraphFLA. (c) Distribution of model performance, measured by Spearman's $\rho$, for Evo2 predictions across the same $155$ landscapes. (d) Correlation matrix showing Spearman's $\rho$ between $20$ landscape features derived from GraphFLA and Evo2 performance across all $155$ combinatorial landscapes.
  • Figure 3: GraphFLA identifies influencing factors for model performance. For (a) our $155$ combinatorial landscapes, (b) ProteinGym, and (c) RNAGym, we plot the distribution of model (name specified in each plot) performance ($y$-axis; measured as Spearman's $\rho$) against landscape features ($x$-axis). Straight lines show a fit of the linear regression model, and shaded regions depict the $95\%$ confidence intervals. Dashed horizontal lines indicate the average performance across all landscapes.
  • Figure 4: Visualizing the distribution of model performance in landscape feature space. We map each of the $5,016$ landscapes constructed from the CIS-BP data in the space composed of landscape features and color-coded with the performance (Spearman's $\rho$) of Evo2-7b to visualize its distribution in the feature space.
  • Figure 5: GraphFLA facilitates landscape-aware model comparison. Difference in performance ($y$-axis) between $5$ pairs of baselines in ProteinGym (a, b, c) and RNAGym (d, e) is plotted against landscape features on the $x$-axis. Line regression fit lines and $95\%$ confidence intervals are depicted.
  • ...and 13 more figures

Theorems & Definitions (16)

  • Definition C.1: Alleles and Loci
  • Example 1
  • Definition C.2: Genotype Space
  • Definition C.3: Hamming Distance
  • Definition C.4: Mutational Neighborhood
  • Definition C.5: Fitness Function
  • Definition C.6: Mutant Genotype Notation
  • Definition C.7: Selection Coefficient
  • Definition C.8: Local and Global Optima
  • Definition C.9: Adaptive Walk
  • ...and 6 more