Table of Contents
Fetching ...

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun

TL;DR

This work addresses the limitation of predicting enzymatic function from DNA by introducing BioTalk, a multimodal benchmark that pairs gene DNA sequences with natural language function descriptions. It establishes four benchmark datasets (SwissProt+TrEMBL and SwissProt-only, balanced and unbalanced) with hierarchical evaluation metrics, k-NN retrieval, clustering quality, and multimodal zero-/few-shot prompts. Baseline results show that Finetuned LOLBERT outperforms other DNA encoders on unsupervised tasks, while multimodal prompts with Llama3 enhance EC-number prediction beyond DNA-only approaches, especially in few-shot regimes. The dataset enables development of interpretable, generalizable models that leverage textual knowledge of enzyme mechanisms alongside sequences, with broad potential impact on functional genomics and model interpretability.

Abstract

Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

TL;DR

This work addresses the limitation of predicting enzymatic function from DNA by introducing BioTalk, a multimodal benchmark that pairs gene DNA sequences with natural language function descriptions. It establishes four benchmark datasets (SwissProt+TrEMBL and SwissProt-only, balanced and unbalanced) with hierarchical evaluation metrics, k-NN retrieval, clustering quality, and multimodal zero-/few-shot prompts. Baseline results show that Finetuned LOLBERT outperforms other DNA encoders on unsupervised tasks, while multimodal prompts with Llama3 enhance EC-number prediction beyond DNA-only approaches, especially in few-shot regimes. The dataset enables development of interpretable, generalizable models that leverage textual knowledge of enzyme mechanisms alongside sequences, with broad potential impact on functional genomics and model interpretability.

Abstract

Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
Paper Structure (12 sections, 2 figures, 5 tables)