Table of Contents
Fetching ...

RoBERTurk: Adjusting RoBERTa for Turkish

Nuri Tas

TL;DR

This work investigates adapting RoBERTa to Turkish morphology by pretraining RoBERTurk with a SentencePiece BPE tokenizer on roughly 28GB of Turkish data. The approach follows RoBERTa-style pretraining with dynamic masking and no next sentence prediction, implemented via FAIRSEQ and mixed precision, totaling about 600k pretraining steps. Empirical evaluation on POS tagging (BOUN, IMST) and NER (XTREME Turkish) shows RoBERTurk outperforms BERTurk on BOUN, underperforms on IMST, and achieves competitive XTREME NER scores, highlighting the impact of dataset and task on morphologically rich language modeling. The authors release the pretrained model and tokenizer to facilitate further Turkish NLP research and applications.

Abstract

We pretrain RoBERTa on a Turkish corpora using BPE tokenizer. Our model outperforms BERTurk family models on the BOUN dataset for the POS task while resulting in underperformance on the IMST dataset for the same task and achieving competitive scores on the Turkish split of the XTREME dataset for the NER task - all while being pretrained on smaller data than its competitors. We release our pretrained model and tokenizer.

RoBERTurk: Adjusting RoBERTa for Turkish

TL;DR

This work investigates adapting RoBERTa to Turkish morphology by pretraining RoBERTurk with a SentencePiece BPE tokenizer on roughly 28GB of Turkish data. The approach follows RoBERTa-style pretraining with dynamic masking and no next sentence prediction, implemented via FAIRSEQ and mixed precision, totaling about 600k pretraining steps. Empirical evaluation on POS tagging (BOUN, IMST) and NER (XTREME Turkish) shows RoBERTurk outperforms BERTurk on BOUN, underperforms on IMST, and achieves competitive XTREME NER scores, highlighting the impact of dataset and task on morphologically rich language modeling. The authors release the pretrained model and tokenizer to facilitate further Turkish NLP research and applications.

Abstract

We pretrain RoBERTa on a Turkish corpora using BPE tokenizer. Our model outperforms BERTurk family models on the BOUN dataset for the POS task while resulting in underperformance on the IMST dataset for the same task and achieving competitive scores on the Turkish split of the XTREME dataset for the NER task - all while being pretrained on smaller data than its competitors. We release our pretrained model and tokenizer.
Paper Structure (8 sections, 4 tables)

This paper contains 8 sections, 4 tables.