Table of Contents
Fetching ...

OntoTune: Ontology-Driven Self-training for Aligning Large Language Models

Zhiqiang Liu, Chengtao Gan, Junjie Wang, Yichi Zhang, Zhongpu Bo, Mengshu Sun, Huajun Chen, Wen Zhang

TL;DR

OntoTune tackles the challenge of reorganizing domain knowledge in LLMs by leveraging existing ontologies through in-context learning. It identifies ontology gaps via concept-level prompts, selects inconsistent responses, and performs self-training with SFT and DPO to align the model with ontology guidance. Across hypernym discovery and medical QA, OntoTune achieves state-of-the-art results while preserving the seed model's knowledge and safety, demonstrating data-efficient domain adaptation using a modest ontology. The approach offers practical benefits for rapid, low-cost domain specialization and generalizes to multilingual and varied medical concepts, suggesting ontology-driven self-training as a viable alternative to large-scale domain corpora.

Abstract

Existing domain-specific Large Language Models (LLMs) are typically developed by fine-tuning general-purposed LLMs with large-scale domain-specific corpora. However, training on large-scale corpora often fails to effectively organize domain knowledge of LLMs, leading to fragmented understanding. Inspired by how humans connect concepts and organize knowledge through mind maps, we aim to emulate this approach by using ontology with hierarchical conceptual knowledge to reorganize LLM's domain knowledge. From this perspective, we propose an ontology-driven self-training framework called OntoTune, which aims to align LLMs with ontology through in-context learning, enabling the generation of responses guided by the ontology. We leverage in-context learning to identify whether the LLM has acquired the specific concept's ontology knowledge, and select the entries not yet mastered by LLM as the training set to further align the LLM with ontology. Compared to existing domain LLMs based on newly collected large-scale domain-specific corpora, our OntoTune, which relies on the existing, long-term developed ontology and LLM itself, significantly reduces data maintenance costs and offers improved generalization ability. We conduct our study in the medical domain to evaluate the effectiveness of OntoTune, utilizing a standardized medical ontology, SNOMED CT as our ontology source. Experimental results demonstrate that OntoTune achieves state-of-the-art performance in both in-ontology task hypernym discovery and out-of-ontology task medical domain QA. Moreover, compared to the latest direct ontology injection method TaxoLLaMA, our OntoTune better preserves original knowledge of LLM. The code and data are available at https://github.com/zjukg/OntoTune.

OntoTune: Ontology-Driven Self-training for Aligning Large Language Models

TL;DR

OntoTune tackles the challenge of reorganizing domain knowledge in LLMs by leveraging existing ontologies through in-context learning. It identifies ontology gaps via concept-level prompts, selects inconsistent responses, and performs self-training with SFT and DPO to align the model with ontology guidance. Across hypernym discovery and medical QA, OntoTune achieves state-of-the-art results while preserving the seed model's knowledge and safety, demonstrating data-efficient domain adaptation using a modest ontology. The approach offers practical benefits for rapid, low-cost domain specialization and generalizes to multilingual and varied medical concepts, suggesting ontology-driven self-training as a viable alternative to large-scale domain corpora.

Abstract

Existing domain-specific Large Language Models (LLMs) are typically developed by fine-tuning general-purposed LLMs with large-scale domain-specific corpora. However, training on large-scale corpora often fails to effectively organize domain knowledge of LLMs, leading to fragmented understanding. Inspired by how humans connect concepts and organize knowledge through mind maps, we aim to emulate this approach by using ontology with hierarchical conceptual knowledge to reorganize LLM's domain knowledge. From this perspective, we propose an ontology-driven self-training framework called OntoTune, which aims to align LLMs with ontology through in-context learning, enabling the generation of responses guided by the ontology. We leverage in-context learning to identify whether the LLM has acquired the specific concept's ontology knowledge, and select the entries not yet mastered by LLM as the training set to further align the LLM with ontology. Compared to existing domain LLMs based on newly collected large-scale domain-specific corpora, our OntoTune, which relies on the existing, long-term developed ontology and LLM itself, significantly reduces data maintenance costs and offers improved generalization ability. We conduct our study in the medical domain to evaluate the effectiveness of OntoTune, utilizing a standardized medical ontology, SNOMED CT as our ontology source. Experimental results demonstrate that OntoTune achieves state-of-the-art performance in both in-ontology task hypernym discovery and out-of-ontology task medical domain QA. Moreover, compared to the latest direct ontology injection method TaxoLLaMA, our OntoTune better preserves original knowledge of LLM. The code and data are available at https://github.com/zjukg/OntoTune.

Paper Structure

This paper contains 37 sections, 7 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: A simple example illustrates how hierarchical structure knowledge in the ontology guide responses.
  • Figure 2: Overview of OntoTune which aligns LLMs with ontology through in-context learning.
  • Figure 3: Ontology-aware corpus generation templates.
  • Figure 4: The templates of TaxoLLaMA*'s instruction-tuning and hypernym discovery task.
  • Figure 5: Performance with different epochs and training samples. The result of MedMCQA is under zero-shot setting.
  • ...and 11 more figures