A Small and Fast BERT for Chinese Medical Punctuation Restoration
Tongtao Ling, Yutao Lai, Lei Chen, Shilei Huang, Yi Liu
TL;DR
This paper tackles punctuation restoration for Chinese medical transcripts by introducing a fast, lightweight pre-trained model built via a pretraining-finetuning pipeline. The core innovations are a novel auxiliary pre-training task, Punctuation Mark Prediction, and supervised contrastive learning, coupled with knowledge distillation to produce compact models. By reformulating punctuation restoration as Slot Tagging during fine-tuning, the approach bridges pre-training and downstream tasks. Empirical results show PMP variants achieve about 95% of the performance of a large RoBERTa baseline while using roughly 10% of the parameters, with ablations confirming the contributions of PMP, SCL, and KD. The work offers a practical, efficient solution for high-stakes clinical NLP, enabling accurate punctuation restoration in medical reports with limited resources.
Abstract
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
