Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Giuseppe Samo; Paola Merlo

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Giuseppe Samo, Paola Merlo

TL;DR

This study investigates how transformer models represent morphologically rich verbal paradigms in Turkish and Hebrew, focusing on how tokenization shapes learning. Using the Blackbird Language Matrices with natural and synthetic data, it demonstrates that Turkish morphosyntax is robust to tokenization granularity—both atomic and subword representations capture paradigm relations—whereas Hebrew’s templatic binyanim require tokenizations that preserve morpho-phonological structure, with monolingual models outperforming multilingual ones. The work introduces a paradigm-level evaluation to reveal how tokenization acts as a linguistic filter, and shows that performance improves on more synthetic data across models. Overall, the findings underscore the need for linguistically informed tokenization choices when modeling languages with diverse morphological typologies and demonstrate the utility of BLM tasks for diagnosing such interactions in internal representations.

Abstract

We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

TL;DR

Abstract

Paper Structure (16 sections, 7 figures, 2 tables)

This paper contains 16 sections, 7 figures, 2 tables.

Introduction
The task
Data and Models
BLM template
Instantiation
Models
Tokenisation
Experiments
Materials & Methods
Data
System
Results
Analyzing the verbal paradigm
Discussion
Related Work
...and 1 more sections

Figures (7)

Figure 1: Verbal paradigm voices under investigation and relative examples for the Turkish verb yaz- and the Hebrew root KTB (related to the act of writing). Hebrew binyanim are adapted from kastner2019templatic, in brackets the name of the binyanim.
Figure 2: BLM Template and instantiation in Turkish and Hebrew. The verb under investigation is underlined in the English translation. The indicated voice label is used only for error analysis, and not for training. The ID of the sentences refer to the dataset where the natural data are extracted as discussed in Section \ref{['data']}.
Figure 3: Number of tokens per voice forms across models and languages.
Figure 4: F1 for each voice as a correct answer across models. The dark violet dotted line in the upper panel indicates chance level.
Figure 5: Confusion matrices of raw counts (test set n = 200)
...and 2 more figures

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

TL;DR

Abstract

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Authors

TL;DR

Abstract

Table of Contents

Figures (7)