Table of Contents
Fetching ...

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe

TL;DR

This work investigates distilling knowledge from GPT-4 to compact NLP models to extract information about endangered species from scientific text. A two-stage pipeline generates synthetic NER and RE data via in-context prompts and is rigorously validated through external knowledge bases and human verification to produce a gold dataset of 3.6K sentences. Fine-tuning three BERT variants on this data demonstrates that domain-specific models, particularly PubMedBERT-large, achieve state-of-the-art NER performance (average F1 around 94.14%), often surpassing the teacher. The results highlight the viability of LLM-based data generation for biodiversity NLP and show that careful verification, coupled with knowledge distillation, can yield high-quality, domain-relevant information extraction systems with practical efficiency gains.

Abstract

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

TL;DR

This work investigates distilling knowledge from GPT-4 to compact NLP models to extract information about endangered species from scientific text. A two-stage pipeline generates synthetic NER and RE data via in-context prompts and is rigorously validated through external knowledge bases and human verification to produce a gold dataset of 3.6K sentences. Fine-tuning three BERT variants on this data demonstrates that domain-specific models, particularly PubMedBERT-large, achieve state-of-the-art NER performance (average F1 around 94.14%), often surpassing the teacher. The results highlight the viability of LLM-based data generation for biodiversity NLP and show that careful verification, coupled with knowledge distillation, can yield high-quality, domain-relevant information extraction systems with practical efficiency gains.

Abstract

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
Paper Structure (23 sections, 6 figures, 8 tables)

This paper contains 23 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of GPT-4 NE and relations for a unique species. We created NER data for four named entities; species, habitat, feeding, breeding, and RE data with three relation classes; live_in, feed_on, breed_by
  • Figure 2: Steps involved in the transfer of knowledge from GPT-4 (teacher) to BERT (student). When, GPT-4 output is incorrect (text shown in red), humans corrected the data. We leveraged external knowledge from knowledge bases such as IUCN, Wikipedia, FishBase, and more, to verify all the species' data. Lastly, we fine-tuned BERT variants.
  • Figure 3: Prompt used to generate all NER and RE data.
  • Figure 4: NER performance for each student model measured by F1-scores.
  • Figure 5: An example of an "easy" text during human evaluation, easy text contains only one sentence.
  • ...and 1 more figures