Table of Contents
Fetching ...

Does your model understand genes? A benchmark of gene properties for biological and text models

Yoav Kan-Tor, Michael Morris Danziger, Eden Zohar, Matan Ninio, Yishai Shimoni

TL;DR

The paper presents a gene-centric benchmark to compare diverse biological foundation models across modalities by evaluating gene embeddings through simple predictive heads on hundreds of curated gene properties. It assembles 312 tasks across five families (Genomic, Regulatory, Localization, Biological processes, Protein properties) and uses standardized input/output with 5-fold cross-validated linear or logistic predictors. Key findings show text-based and protein-language models generally outperform expression-based models on genomic/regulatory tasks, while expression-based models excel at localization; model size offers limited predictive advantage, suggesting complementary benefits from multi-modal integration. The work provides an open-source benchmarking platform to guide future AI strategies in biology and therapeutic discovery, with potential extensions to fine-tuning or QA tasks across modalities.

Abstract

The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene-benchmark.

Does your model understand genes? A benchmark of gene properties for biological and text models

TL;DR

The paper presents a gene-centric benchmark to compare diverse biological foundation models across modalities by evaluating gene embeddings through simple predictive heads on hundreds of curated gene properties. It assembles 312 tasks across five families (Genomic, Regulatory, Localization, Biological processes, Protein properties) and uses standardized input/output with 5-fold cross-validated linear or logistic predictors. Key findings show text-based and protein-language models generally outperform expression-based models on genomic/regulatory tasks, while expression-based models excel at localization; model size offers limited predictive advantage, suggesting complementary benefits from multi-modal integration. The work provides an open-source benchmarking platform to guide future AI strategies in biology and therapeutic discovery, with potential extensions to fine-tuning or QA tasks across modalities.

Abstract

The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene-benchmark.

Paper Structure

This paper contains 22 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Gene Benchmark evaluation flow Diverse pretrained models are benchmarked based on the ability of their gene representations to predict gene properties as collected in tasks. For example, CREM is a transcription factor while KEAP1 is not. Depending on the model, gene representations may be token embeddings (as in transcriptomics) or they may be constructed from a gene sequence (textual, base-pair or amino acid sequence) that is encoded by the model. These vectors are then used to train a simple configurable predictive model for the task. The model performance after cross-validation represents the score for the pretrained model on the task. The tasks are described in Section \ref{['sec:benchmark-tasks']} and code to run the pipeline shown above is available at http://github.com/BiomedSciAI/gene-benchmark .
  • Figure 2: The performance of each model on the task families as measured by average area under the ROC curve. Parentheses show the corresponding standard deviation across all tasks of the same family.
  • Figure S1: The performance of each model on the task families as measured by average f1 score. Parentheses show the corresponding standard deviation across all tasks of the same family.
  • Figure S2: Mean AUC per model and task
  • Figure S3: Model performance measured by mean AUC for binary tasks derived from the multi label task 'protein class'
  • ...and 9 more figures