Table of Contents
Fetching ...

Towards Ontology-Enhanced Representation Learning for Large Language Models

Francesco Ronzano, Jay Nanavati

TL;DR

The paper addresses the challenge of enriching embedding-LLMs with domain-specific ontological knowledge. It proposes ontology-driven knowledge infusion by generating synthetic concept definitions from the MONDO ontology via GPT-3.5-turbo, creating semantically related training pairs through synonym substitution, and applying a contrastive learning objective with hard negatives to fine-tune embedding models. Evaluation on biomedical sentence similarity benchmarks (BIOSSES and SemEval) shows consistent in-domain gains across four embedding-LLMs while preserving out-of-domain performance, with bigger improvements for simpler baselines like PubMedBERT and SapBERT. The work demonstrates a scalable approach to integrate ontologies into text embeddings, offering a reproducible workflow that can extend to other ontologies and domains.

Abstract

Taking advantage of the widespread use of ontologies to organise and harmonize knowledge across several distinct domains, this paper proposes a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing the knowledge formalized by a reference ontology: ontological knowledge infusion aims at boosting the ability of the considered LLM to effectively model the knowledge domain described by the infused ontology. The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) formalized by the ontology are utilized to compile a comprehensive set of concept definitions, with the assistance of a powerful generative LLM (i.e. GPT-3.5-turbo). These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework. To demonstrate and evaluate the proposed approach, we utilize the biomedical disease ontology MONDO. The results show that embedding-LLMs enhanced by ontological disease knowledge exhibit an improved capability to effectively evaluate the similarity of in-domain sentences from biomedical documents mentioning diseases, without compromising their out-of-domain performance.

Towards Ontology-Enhanced Representation Learning for Large Language Models

TL;DR

The paper addresses the challenge of enriching embedding-LLMs with domain-specific ontological knowledge. It proposes ontology-driven knowledge infusion by generating synthetic concept definitions from the MONDO ontology via GPT-3.5-turbo, creating semantically related training pairs through synonym substitution, and applying a contrastive learning objective with hard negatives to fine-tune embedding models. Evaluation on biomedical sentence similarity benchmarks (BIOSSES and SemEval) shows consistent in-domain gains across four embedding-LLMs while preserving out-of-domain performance, with bigger improvements for simpler baselines like PubMedBERT and SapBERT. The work demonstrates a scalable approach to integrate ontologies into text embeddings, offering a reproducible workflow that can extend to other ontologies and domains.

Abstract

Taking advantage of the widespread use of ontologies to organise and harmonize knowledge across several distinct domains, this paper proposes a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing the knowledge formalized by a reference ontology: ontological knowledge infusion aims at boosting the ability of the considered LLM to effectively model the knowledge domain described by the infused ontology. The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) formalized by the ontology are utilized to compile a comprehensive set of concept definitions, with the assistance of a powerful generative LLM (i.e. GPT-3.5-turbo). These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework. To demonstrate and evaluate the proposed approach, we utilize the biomedical disease ontology MONDO. The results show that embedding-LLMs enhanced by ontological disease knowledge exhibit an improved capability to effectively evaluate the similarity of in-domain sentences from biomedical documents mentioning diseases, without compromising their out-of-domain performance.
Paper Structure (14 sections, 1 equation, 1 figure, 3 tables)