BERTje: A Dutch BERT Model
Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, Malvina Nissim
TL;DR
The paper introduces BERTje, a monolingual Dutch BERT model trained on a diverse 12GB Dutch corpus and evaluated against multilingual BERT across Dutch NLP tasks. By adopting a Sentence Order Prediction objective and a refined masked language modeling strategy, BERTje demonstrates consistent improvements in NER, POS tagging, SRL/STR, and sentiment analysis. The results suggest that monolingual pre-training with multi-genre data yields tangible benefits over multilingual counterparts, and that higher-level linguistic information benefits from extended training beyond 850k iterations. These findings advocate for monolingual pre-training in non-English languages and provide practical guidance on pre-training data composition and training dynamics.
Abstract
The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized multilingual BERT model on downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic role labeling, and sentiment analysis). Our pre-trained Dutch BERT model is made available at https://github.com/wietsedv/bertje.
