LegalTurk Optimized BERT for Multi-Label Text Classification and NER
Farnaz Zeidi, Mehmet Fatih Amasyali, Çiğdem Erol
TL;DR
This study targets the Turkish legal domain to boost BERT-based text classification and NER by rethinking pre-training rather than changing the model architecture. It systematically compares NSP, SOP, and MLM masking strategies, including TF-IDF-guided token replacements, while training from scratch on a domain-specific corpus. Across two annotation-rich tasks (NER and multi-label classification), the authors demonstrate that removing NSP and refining MLM (especially with TF-IDF masking) yields strong gains, particularly when pre-training uses 2 GB of legal Turkish data. Despite a relatively small corpus, the domain-adapted models achieve competitive or superior performance to the broader Turkish BERTurk, underscoring the value of task- and domain-focused pre-training for non-English legal NLP. The work also outlines practical paths for extending coverage, including larger corpora, additional tasks, and multilingual experiments to further improve generalization and usefulness in legal contexts.
Abstract
The introduction of the Transformer neural network, along with techniques like self-supervised pre-training and transfer learning, has paved the way for advanced models like BERT. Despite BERT's impressive performance, opportunities for further enhancement exist. To our knowledge, most efforts are focusing on improving BERT's performance in English and in general domains, with no study specifically addressing the legal Turkish domain. Our study is primarily dedicated to enhancing the BERT model within the legal Turkish domain through modifications in the pre-training phase. In this work, we introduce our innovative modified pre-training approach by combining diverse masking strategies. In the fine-tuning task, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification. To evaluate our modified pre-training approach, we fine-tuned all customized models alongside the original BERT models to compare their performance. Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model. Finally, to showcase the impact of our proposed models, we trained our best models with different corpus sizes and compared them with BERTurk models. The experimental results demonstrate that our innovative approach, despite being pre-trained on a smaller corpus, competes with BERTurk.
