Table of Contents
Fetching ...

Novi jezički modeli za srpski jezik

Mihailo Škorić

TL;DR

This work addresses the need for robust Serbian language transformers by surveying published models and conducting a systematic benchmark of ten Serbian text-vectorization models across four NLP tasks. It leverages resources from the Society for Language Resources and Technologies and introduces two new models, jerteh-81 and jerteh-355, to evaluate how architecture, parameter count, and data scale affect performance. The study finds that RoBERTa-based models benefit from large Serbian data on upstream tasks, while XLM-RoBERTa-based models perform comparatively well downstream, with the largest gains seen for models trained on substantial, high-quality Serbian corpora. The results offer practical guidance for training Serbian language models and underscore the importance of corpus quality and task-specific fine-tuning for effective text vectorization.

Abstract

The paper will briefly present the development history of transformer-based language models for the Serbian language. Several new models for text generation and vectorization, trained on the resources of the Society for Language Resources and Technologies, will also be presented. Ten selected vectorization models for Serbian, including two new ones, will be compared on four natural language processing tasks. Paper will analyze which models are the best for each selected task, how does their size and the size of their training sets affect the performance on those tasks, and what is the optimal setting to train the best language models for the Serbian language.

Novi jezički modeli za srpski jezik

TL;DR

This work addresses the need for robust Serbian language transformers by surveying published models and conducting a systematic benchmark of ten Serbian text-vectorization models across four NLP tasks. It leverages resources from the Society for Language Resources and Technologies and introduces two new models, jerteh-81 and jerteh-355, to evaluate how architecture, parameter count, and data scale affect performance. The study finds that RoBERTa-based models benefit from large Serbian data on upstream tasks, while XLM-RoBERTa-based models perform comparatively well downstream, with the largest gains seen for models trained on substantial, high-quality Serbian corpora. The results offer practical guidance for training Serbian language models and underscore the importance of corpus quality and task-specific fine-tuning for effective text vectorization.

Abstract

The paper will briefly present the development history of transformer-based language models for the Serbian language. Several new models for text generation and vectorization, trained on the resources of the Society for Language Resources and Technologies, will also be presented. Ten selected vectorization models for Serbian, including two new ones, will be compared on four natural language processing tasks. Paper will analyze which models are the best for each selected task, how does their size and the size of their training sets affect the performance on those tasks, and what is the optimal setting to train the best language models for the Serbian language.
Paper Structure (13 sections, 5 figures, 4 tables)

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Тачност модела на задатку моделовања маскираног језика према величини тих модела (лево), као и према величини скупова за обучавање модела (десно). Приказана крива тренда одговара логаритамској функцији.
  • Figure 2: Тачност модела на задатку моделовања маскираног језика према величини тих модела (лево), као и према величини скупова за обучавање модела (десно), при чему су уклоњени резултати модела заснованих на XLM-R архитектури. Приказана крива тренда одговара логаритамској функцији.
  • Figure 3: Тачност модела на задатку угњежђивања према величини тих модела (лево), као и према величини скупова за обучавање модела (десно). Приказана крива тренда одговара логаритамској функцији.
  • Figure 4: Перформансе модела на задатку обележавања врстом речи према величини тих модела (лево), као и према величини скупова за обучавање модела (десно). Приказана крива тренда одговара логаритамској функцији.
  • Figure 5: Перформансе модела на задатку препознавања именованих ентитета према величини тих модела (лево), као и према величини скупова за обучавање модела (десно). Приказана крива тренда одговара логаритамској функцији.