Table of Contents
Fetching ...

GP-GPT: Large Language Model for Gene-Phenotype Mapping

Yanjun Lyu, Zihao Wu, Lu Zhang, Jing Zhang, Yiwei Li, Wei Ruan, Zhengliang Liu, Zeyu Zhang, Xiang Li, Rongjie Liu, Chao Huang, Wentao Li, Tianming Liu, Dajiang Zhu

TL;DR

GP-GPT introduces a specialized large language model for holistic gene–phenotype knowledge representation by grounding a two-stage, instruction-fine-tuned architecture on a multi-source genomics corpus. Built on the Llama family with LoRA/QLoRA adapters, GP-GPT demonstrates superior performance on genetic medical QA and genomics relation determination compared to several baselines, including GPT-4. The study also shows that fine-tuning improves gene–phenotype embeddings and enables integration toward a holistic genomics knowledge graph, with potential applications in AI-assisted genetic disease prediction. Limitations include data coverage and benchmarking gaps, while future work envisions multi-modality integration and dementia-related genotype–phenotype discovery. Overall, GP-GPT represents a foundational step toward scalable, knowledge-aware genomics language models.

Abstract

Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.

GP-GPT: Large Language Model for Gene-Phenotype Mapping

TL;DR

GP-GPT introduces a specialized large language model for holistic gene–phenotype knowledge representation by grounding a two-stage, instruction-fine-tuned architecture on a multi-source genomics corpus. Built on the Llama family with LoRA/QLoRA adapters, GP-GPT demonstrates superior performance on genetic medical QA and genomics relation determination compared to several baselines, including GPT-4. The study also shows that fine-tuning improves gene–phenotype embeddings and enables integration toward a holistic genomics knowledge graph, with potential applications in AI-assisted genetic disease prediction. Limitations include data coverage and benchmarking gaps, while future work envisions multi-modality integration and dementia-related genotype–phenotype discovery. Overall, GP-GPT represents a foundational step toward scalable, knowledge-aware genomics language models.

Abstract

Pre-trained large language models(LLMs) have attracted increasing attention in biomedical domains due to their success in natural language processing. However, the complex traits and heterogeneity of multi-sources genomics data pose significant challenges when adapting these models to the bioinformatics and biomedical field. To address these challenges, we present GP-GPT, the first specialized large language model for genetic-phenotype knowledge representation and genomics relation analysis. Our model is fine-tuned in two stages on a comprehensive corpus composed of over 3,000,000 terms in genomics, proteomics, and medical genetics, derived from multiple large-scale validated datasets and scientific publications. GP-GPT demonstrates proficiency in accurately retrieving medical genetics information and performing common genomics analysis tasks, such as genomics information retrieval and relationship determination. Comparative experiments across domain-specific tasks reveal that GP-GPT outperforms state-of-the-art LLMs, including Llama2, Llama3 and GPT-4. These results highlight GP-GPT's potential to enhance genetic disease relation research and facilitate accurate and efficient analysis in the fields of genomics and medical genetics. Our investigation demonstrated the subtle changes of bio-factor entities' representations in the GP-GPT, which suggested the opportunities for the application of LLMs to advancing gene-phenotype research.
Paper Structure (35 sections, 10 figures, 2 tables)

This paper contains 35 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Integration of multiple datasets.
  • Figure 2: Overview of multi-task multi-level formats of genomics text data. The training data set can be built on intrinsic logic in multiple genomics datasets.
  • Figure 3: Tuning model at the first stage using instruction mask training data. The bio-text has been fitted into the input format provided by the Llama model. The signs: '### Instruction:, '### Input:', and '### Output:', stand for the indicators inside the model input. The red words indicate the replaceable gene entities and phenotype entities.
  • Figure 4: Formatted genomics contexts of gene–protein. The red words indicate the replaceable gene entities and phenotype entities. The examples in the 'Instances' column show the available text which can be used to fit into the position of red words in the Gene-Protein contexts.
  • Figure 5: Formatted genomics contexts of gene-protein-disease/phenotype. The red words indicate the replaceable gene entities and phenotype entities. The examples in the 'Instances' column show the available text which can be used to fit into the position of red words in the Gene-Protein-Phenotype/Disease contexts.
  • ...and 5 more figures