Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe
TL;DR
This work investigates distilling knowledge from GPT-4 to compact NLP models to extract information about endangered species from scientific text. A two-stage pipeline generates synthetic NER and RE data via in-context prompts and is rigorously validated through external knowledge bases and human verification to produce a gold dataset of 3.6K sentences. Fine-tuning three BERT variants on this data demonstrates that domain-specific models, particularly PubMedBERT-large, achieve state-of-the-art NER performance (average F1 around 94.14%), often surpassing the teacher. The results highlight the viability of LLM-based data generation for biodiversity NLP and show that careful verification, coupled with knowledge distillation, can yield high-quality, domain-relevant information extraction systems with practical efficiency gains.
Abstract
Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.
