Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training
Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, Min Zhang
TL;DR
This work tackles the dependence of Chinese sequence labeling on accurate word boundaries by extending BABERT with supervised lexical boundary signals through a semi-supervised framework, combining MLM, boundary-aware objectives, and PU learning. A span-based boundary recognition task is introduced to handle nested boundaries and iteratively expand a high-quality lexicon, guided by PU learning. The authors further propose the Boundary Information Metric (BIM) to quantify boundary awareness without task-specific fine-tuning, enabling rapid model comparison. Experiments across 13 CWS/POS/NER datasets and CLUE tasks show Semi-BABERT consistently improves boundary encoding and overall performance, with strong benefits in low-resource and few-shot settings. Overall, the approach demonstrates that injecting high-quality boundary information into pre-training can significantly bolster Chinese language understanding, even beyond boundary-sensitive tasks.
Abstract
Chinese sequence labeling tasks are heavily reliant on accurate word boundary demarcation. Although current pre-trained language models (PLMs) have achieved substantial gains on these tasks, they rarely explicitly incorporate boundary information into the modeling process. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT's learning, developing a semi-supervised boundary-aware PLM. To assess PLMs' ability to encode boundaries, we introduce a novel ``Boundary Information Metric'' that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT variant outperforms the vanilla version, not only on these tasks but also more broadly across a range of Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs' boundary awareness.
