Table of Contents
Fetching ...

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford

TL;DR

PhayaThaiBERT addresses Thai language modeling gaps related to unassimilated loanwords by expanding WangchanBERTa’s vocabulary through cross-model transfer from XLM-R and by pretraining on a substantially larger Thai corpus. The approach combines a RoBERTa-based architecture with dual embeddings for new vocabulary, starting from WangchanBERTa’s weights and using a carefully designed training procedure to stabilize learning. Empirical results show PhayaThaiBERT generally improves over WangchanBERTa on multiple downstream tasks and reduces OOV rates, though the model becomes significantly larger and more compute-intensive. The findings highlight the value of explicit foreign word vocabulary expansion for language models in highly code-switched contexts and provide a publicly available Thai language model for broader use.

Abstract

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

TL;DR

PhayaThaiBERT addresses Thai language modeling gaps related to unassimilated loanwords by expanding WangchanBERTa’s vocabulary through cross-model transfer from XLM-R and by pretraining on a substantially larger Thai corpus. The approach combines a RoBERTa-based architecture with dual embeddings for new vocabulary, starting from WangchanBERTa’s weights and using a carefully designed training procedure to stabilize learning. Empirical results show PhayaThaiBERT generally improves over WangchanBERTa on multiple downstream tasks and reduces OOV rates, though the model becomes significantly larger and more compute-intensive. The findings highlight the value of explicit foreign word vocabulary expansion for language models in highly code-switched contexts and provide a publicly available Thai language model for broader use.

Abstract

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.
Paper Structure (38 sections, 6 tables)