Table of Contents
Fetching ...

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, W. Jim Zheng, Hongyu Zhao

TL;DR

This work proposes a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research, and demonstrates that adapted LLMs and MLLMs perform well and may outperform closed-source large-scale models based on evaluations focusing on both truthfulness and structural correctness.

Abstract

The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible.

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

TL;DR

This work proposes a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research, and demonstrates that adapted LLMs and MLLMs perform well and may outperform closed-source large-scale models based on evaluations focusing on both truthfulness and structural correctness.

Abstract

The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible.
Paper Structure (22 sections, 9 figures, 3 tables)

This paper contains 22 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The landscape of Geneverse. To generate LLMs for genomic and proteomic analysis, we incorporate the training datasets from rephrased descriptions for gene functions as well as synthetic descriptions from GPT 3.5. We then adjust the base model with different strategies and select the best candidate. To generate MLLMs for genomic and proteomic analysis, we incorporate the training datasets from known databases, including both descriptions and corresponding images. We then finetune the base model with different strategies and select the best candidate. The logo of Geneverse is generated by DALLE dalle.
  • Figure 1: Definition of different evaluators or scorers. For the scorer focusing on truthfulness, we evaluate the matching level of model outputs for the description of gene properties and gene functions. For the scorer focusing on structural correctness, we evaluate the correctness of the structure of model outputs by comparing them to the limitations in the prompt. The logos of scorers are generated by DALLE dalle.
  • Figure 2: UMAPs for the gene embeddings colored by gene functional information. Panels (a)-(c) represent the outputs of LLMs trained based on datasets from NCBI+GPT 3.5, NCBI only and GPT 3.5. We report the NMI score of each embeddings followed by their sources.
  • Figure 2: Figures of GOEA results. Each figure represents top 10 pathways in one cluster, and the pathways are ranked by $-\text{log(Adjusted P-value)}$.
  • Figure 3: Results of sensitivity analysis for the training of different models. (a) The relation between the number of epochs and model performance of LLMs. (b) The relation between the number of cut-off length and model performance of LLMs. (c) The relation between the number of epochs and model performance of MLLMs for the protein task. (d) The relation between the number of epochs and model performance of MLLMs for the gene task.
  • ...and 4 more figures