Table of Contents
Fetching ...

LLMs4Life: Large Language Models for Ontology Learning in Life Sciences

Nadeen Fathallah, Steffen Staab, Alsayed Algergawy

TL;DR

This work tackles the challenge of generating deep, domain-specific ontologies in life sciences with large language models by extending the NeOn-GPT pipeline through domain-driven prompt engineering and ontology reuse. Using AquaDiva as a case study, the authors demonstrate that iterative re-prompting, controlled task framing, and reuse of established ontologies (e.g., ENVO) substantially improve hierarchical depth, relational richness, and logical consistency, though full alignment with gold standards remains challenging due to token limits and domain complexity. The key contributions include a detailed prompt pipeline, a reuse-driven approach to deepen hierarchies, and an evaluation framework based on AquaDiva and ENVO with AML matching to quantify improvements. The findings indicate LLM-based ontology learning can progress toward applicable domain-specific ontologies in complex scientific domains, with practical implications for interoperability and data integration in life sciences, while highlighting the need for semi-automatic human guidance and retrieval-augmented techniques to reach gold-standard completeness.

Abstract

Ontology learning in complex domains, such as life sciences, poses significant challenges for current Large Language Models (LLMs). Existing LLMs struggle to generate ontologies with multiple hierarchical levels, rich interconnections, and comprehensive class coverage due to constraints on the number of tokens they can generate and inadequate domain adaptation. To address these issues, we extend the NeOn-GPT pipeline for ontology learning using LLMs with advanced prompt engineering techniques and ontology reuse to enhance the generated ontologies' domain-specific reasoning and structural depth. Our work evaluates the capabilities of LLMs in ontology learning in the context of highly specialized and complex domains such as life science domains. To assess the logical consistency, completeness, and scalability of the generated ontologies, we use the AquaDiva ontology developed and used in the collaborative research center AquaDiva as a case study. Our evaluation shows the viability of LLMs for ontology learning in specialized domains, providing solutions to longstanding limitations in model performance and scalability.

LLMs4Life: Large Language Models for Ontology Learning in Life Sciences

TL;DR

This work tackles the challenge of generating deep, domain-specific ontologies in life sciences with large language models by extending the NeOn-GPT pipeline through domain-driven prompt engineering and ontology reuse. Using AquaDiva as a case study, the authors demonstrate that iterative re-prompting, controlled task framing, and reuse of established ontologies (e.g., ENVO) substantially improve hierarchical depth, relational richness, and logical consistency, though full alignment with gold standards remains challenging due to token limits and domain complexity. The key contributions include a detailed prompt pipeline, a reuse-driven approach to deepen hierarchies, and an evaluation framework based on AquaDiva and ENVO with AML matching to quantify improvements. The findings indicate LLM-based ontology learning can progress toward applicable domain-specific ontologies in complex scientific domains, with practical implications for interoperability and data integration in life sciences, while highlighting the need for semi-automatic human guidance and retrieval-augmented techniques to reach gold-standard completeness.

Abstract

Ontology learning in complex domains, such as life sciences, poses significant challenges for current Large Language Models (LLMs). Existing LLMs struggle to generate ontologies with multiple hierarchical levels, rich interconnections, and comprehensive class coverage due to constraints on the number of tokens they can generate and inadequate domain adaptation. To address these issues, we extend the NeOn-GPT pipeline for ontology learning using LLMs with advanced prompt engineering techniques and ontology reuse to enhance the generated ontologies' domain-specific reasoning and structural depth. Our work evaluates the capabilities of LLMs in ontology learning in the context of highly specialized and complex domains such as life science domains. To assess the logical consistency, completeness, and scalability of the generated ontologies, we use the AquaDiva ontology developed and used in the collaborative research center AquaDiva as a case study. Our evaluation shows the viability of LLMs for ontology learning in specialized domains, providing solutions to longstanding limitations in model performance and scalability.

Paper Structure

This paper contains 21 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our proposed methodology to extend the NeOn-GPT pipeline for more complicated domains such as life science domains. The process begins with the ontology domain, incorporating domain-specific descriptions and keywords. The methodology employs a pre-trained LLM to follow a structured sequence of steps: specification of ontology requirements, reuse of ontological knowledge resources, ontology conceptualization, implementation, and verification, producing the final ontology.
  • Figure 2: Visualization of the AquaDiva ontology generated from Experiment 1. The left side presents ontology metrics. The center panel shows a portion of the hierarchy of classes and their relationships (visualized using Protégé), while the right side features a structural network representation of the ontology generated using WebVOWL 1.1.7.
  • Figure 3: Visualization of the AquaDiva ontology generated from Experiment 2.
  • Figure 4: Visualization of the AquaDiva ontology generated from Experiment 3.
  • Figure 5: Visualization of the Habitat ontology within the AquaDiva ontology generated from Experiment 4.
  • ...and 2 more figures