DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

Sangyeon Cho; Jangyeong Jeon; Dongjoon Lee; Changhee Lee; Junyeong Kim

DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

Sangyeon Cho, Jangyeong Jeon, Dongjoon Lee, Changhee Lee, Junyeong Kim

TL;DR

This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea and demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model.

Abstract

The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.

DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 4 figures, 4 tables)

This paper contains 22 sections, 11 equations, 4 figures, 4 tables.

Introduction
Related Works
NLP for Clinical notes (EMR)
N-Lingual free-text data
knowledge distillation
Proposed Method
Text Preprocessing & Labeling
Method
Definition of Domain Knowledge
Problem Formulation
Knowledge distillation
Prediction Loss
Experiments
Dataset
Implementation Details
...and 7 more sections

Figures (4)

Figure 1: Generalized LM performs better. Performance of each pre-trained LM on an emergency/non-emergency classification task using EMR data from Korean PEDs in terms of (a) AUROC and (b) AUPRC.
Figure 2: Model Architecture (Knowledge Transfer) - Visualization of the proposed methodology. The architecture consists of both student and teacher models based on transformers, which act as encoder blocks with multi-head attention (MHA) and feed-forward networks (FFN). The student model performs a prediction task ($\mathcal{L}_{\text{pred}}$) and receives appropriate domain knowledge from the teacher model by minimizing $\mathcal{L}_{\text{hidn}}$ and $\mathcal{L}_{\text{attn}}$. In the figure, we define "vomiting" and "fever" as domain knowledge words ($k=2$) and perform distillation by receiving appropriate representations from the hidden states and attention matrices that arise from the teacher model. Overall, the goal of the proposed architecture is to transfer the teacher model's knowledge to the student, not only in terms of classification predictions but also throughout the model's internal representation, thereby enabling the student model to make decisions based on a deeper and more nuanced understanding of the input data.
Figure 3: Qualitative analysis: Visual representation of the effectiveness of the proposed training method. S denotes a student model, which is taken to be Ko-BERT, and the teacher is taken to be KM-BERT. The gray dots correspond to an independently trained student model, the green dots correspond to the proposed training methodology, and the red dots correspond to the spatial coordinates of the teacher model's representation
Figure 4: Case analysis: This figure shows calculated MWPS for correctly classified cases in the test set for each model. The proposed model (Ours) uses Ko-BERT as the student model and KM-BERT as the teacher model. By achieving the highest MWPS performance compared to all other models, our methodology demonstrates an effective understanding of medical words. Notably, the higher MWPS compared to the teacher model is particularly encouraging in terms of domain knowledge comprehension.

DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

TL;DR

Abstract

DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)