Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics
Anas Belfathi, Ygor Gallina, Nicolas Hernandez, Richard Dufour, Laura Monceaux
TL;DR
The paper addresses adapting masked language models to specialized domains by moving beyond random masking to selective masking informed by genre and topical cues. It introduces two word-importance scores, MetaDiscourse for genre-specificity and TF-IDF for topical salience, and a ranking-guided masking strategy evaluated through continual pre-training on a legal corpus. Empirical results on LexGLUE show consistent gains over baselines, with task-dependent advantages for MetaDiscourse and TF-IDF and varying benefits from Rand versus TopN masking. The work highlights practical benefits for domain-focused NLP applications and provides open-source resources for reproducibility and broader domain exploration.
Abstract
Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.
