Table of Contents
Fetching ...

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

Marc Brinner, Tarek Al Mustafa, Sina Zarrieß

TL;DR

This work tackles the challenge of scarce domain data for encoder pretraining by injecting domain knowledge through ontology-informed embeddings or LLM-derived concepts. It introduces a similarity-based pretraining objective that positions concept definitions in a shared embedding space, augmented by concept relatedness, and combines this with traditional MLM losses. In invasion biology, the approach is validated on a new four-task benchmark, showing that ontology-derived data with SIM and MLM yields state-of-the-art-like gains (e.g., benchmark ≈0.538), while an automated LLM-driven pipeline can approach these results in low-resource settings and remains robust to data-induced instability. The findings highlight the complementary strengths of structured ontologies and LLM-generated data for domain adaptation, with practical potential for rapid deployment in other low-resource scientific domains.

Abstract

We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

TL;DR

This work tackles the challenge of scarce domain data for encoder pretraining by injecting domain knowledge through ontology-informed embeddings or LLM-derived concepts. It introduces a similarity-based pretraining objective that positions concept definitions in a shared embedding space, augmented by concept relatedness, and combines this with traditional MLM losses. In invasion biology, the approach is validated on a new four-task benchmark, showing that ontology-derived data with SIM and MLM yields state-of-the-art-like gains (e.g., benchmark ≈0.538), while an automated LLM-driven pipeline can approach these results in low-resource settings and remains robust to data-induced instability. The findings highlight the complementary strengths of structured ontologies and LLM-generated data for domain adaptation, with practical potential for rapid deployment in other low-resource scientific domains.

Abstract

We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.

Paper Structure

This paper contains 31 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The Llama-3-8B-Instruct prompt for generating alternative definitions for concepts from the ontology.