Towards Ontology-Enhanced Representation Learning for Large Language Models

Francesco Ronzano; Jay Nanavati

Towards Ontology-Enhanced Representation Learning for Large Language Models

Francesco Ronzano, Jay Nanavati

TL;DR

The paper addresses the challenge of enriching embedding-LLMs with domain-specific ontological knowledge. It proposes ontology-driven knowledge infusion by generating synthetic concept definitions from the MONDO ontology via GPT-3.5-turbo, creating semantically related training pairs through synonym substitution, and applying a contrastive learning objective with hard negatives to fine-tune embedding models. Evaluation on biomedical sentence similarity benchmarks (BIOSSES and SemEval) shows consistent in-domain gains across four embedding-LLMs while preserving out-of-domain performance, with bigger improvements for simpler baselines like PubMedBERT and SapBERT. The work demonstrates a scalable approach to integrate ontologies into text embeddings, offering a reproducible workflow that can extend to other ontologies and domains.

Abstract

Taking advantage of the widespread use of ontologies to organise and harmonize knowledge across several distinct domains, this paper proposes a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing the knowledge formalized by a reference ontology: ontological knowledge infusion aims at boosting the ability of the considered LLM to effectively model the knowledge domain described by the infused ontology. The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) formalized by the ontology are utilized to compile a comprehensive set of concept definitions, with the assistance of a powerful generative LLM (i.e. GPT-3.5-turbo). These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework. To demonstrate and evaluate the proposed approach, we utilize the biomedical disease ontology MONDO. The results show that embedding-LLMs enhanced by ontological disease knowledge exhibit an improved capability to effectively evaluate the similarity of in-domain sentences from biomedical documents mentioning diseases, without compromising their out-of-domain performance.

Towards Ontology-Enhanced Representation Learning for Large Language Models

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 1 figure, 3 tables)

This paper contains 14 sections, 1 equation, 1 figure, 3 tables.

Introduction
Related work
Workflow to infuse ontological knowledge in embedding-LLMs
Fine-tuning embedding-LLMs by contrastive learning
Contrastive learning architecture and training objective
Ontology-driven creation of training samples
Infusing disease knowledge relying on MONDO ontology
Evaluation: datasets and results
Discussion
Conclusions and future work
GPT-3.5-turbo prompts to generate synthetic definitions
Fine-tuning hyper-parameters
Detailed information on evaluation datasets
Synonym filtering rules applied to MONDO ontology

Figures (1)

Figure 1: (a) structure of ontologies, generation of synthetic concept definitions; (b) creation of pairs os semantically related sentences by synonym substitution; (c) overview of the ontological knowledge infusion approach.

Towards Ontology-Enhanced Representation Learning for Large Language Models

TL;DR

Abstract

Towards Ontology-Enhanced Representation Learning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)