Table of Contents
Fetching ...

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Anas Belfathi, Ygor Gallina, Nicolas Hernandez, Richard Dufour, Laura Monceaux

TL;DR

The paper addresses adapting masked language models to specialized domains by moving beyond random masking to selective masking informed by genre and topical cues. It introduces two word-importance scores, MetaDiscourse for genre-specificity and TF-IDF for topical salience, and a ranking-guided masking strategy evaluated through continual pre-training on a legal corpus. Empirical results on LexGLUE show consistent gains over baselines, with task-dependent advantages for MetaDiscourse and TF-IDF and varying benefits from Rand versus TopN masking. The work highlights practical benefits for domain-focused NLP applications and provides open-source resources for reproducibility and broader domain exploration.

Abstract

Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

TL;DR

The paper addresses adapting masked language models to specialized domains by moving beyond random masking to selective masking informed by genre and topical cues. It introduces two word-importance scores, MetaDiscourse for genre-specificity and TF-IDF for topical salience, and a ranking-guided masking strategy evaluated through continual pre-training on a legal corpus. Empirical results on LexGLUE show consistent gains over baselines, with task-dependent advantages for MetaDiscourse and TF-IDF and varying benefits from Rand versus TopN masking. The work highlights practical benefits for domain-focused NLP applications and provides open-source resources for reproducibility and broader domain exploration.

Abstract

Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.
Paper Structure (17 sections, 1 equation, 3 tables, 1 algorithm)