TookaBERT: A Step Forward for Persian NLU
MohammadAli SadraeiJavaheri, Ali Moghaddaszadeh, Milad Molazadeh, Fariba Naeiji, Farnaz Aghababaloo, Hamideh Rafiee, Zahra Amirmahani, Tohid Abedini, Fatemeh Zahra Sheikhi, Amirmohammad Salehoof
TL;DR
This work addresses the lack of large-scale Persian BERT models by introducing TookaBERT-Base and TookaBERT-Large, trained with a 48k BPE tokenizer and MLM-only pre-training on a large Persian dataset. The authors employ advanced training techniques (flash attention, ZeRO-2, whole-word masking) and evaluate across 14 Persian NLU tasks, demonstrating that TookaBERT-Large achieves an average improvement of $+2.8$ points over existing baselines. A rigorous evaluation protocol tunes multiple learning rates and compares against several Persian and multilingual baselines, establishing robust gains for the larger model. Public availability of both checkpoints aims to accelerate Persian NLP research and broaden the applicability of foundation models to low-resource languages.
Abstract
The field of natural language processing (NLP) has seen remarkable advancements, thanks to the power of deep learning and foundation models. Language models, and specifically BERT, have been key players in this progress. In this study, we trained and introduced two new BERT models using Persian data. We put our models to the test, comparing them to seven existing models across 14 diverse Persian natural language understanding (NLU) tasks. The results speak for themselves: our larger model outperforms the competition, showing an average improvement of at least +2.8 points. This highlights the effectiveness and potential of our new BERT models for Persian NLU tasks.
