Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo, Paola Merlo
TL;DR
This study investigates how transformer models represent morphologically rich verbal paradigms in Turkish and Hebrew, focusing on how tokenization shapes learning. Using the Blackbird Language Matrices with natural and synthetic data, it demonstrates that Turkish morphosyntax is robust to tokenization granularity—both atomic and subword representations capture paradigm relations—whereas Hebrew’s templatic binyanim require tokenizations that preserve morpho-phonological structure, with monolingual models outperforming multilingual ones. The work introduces a paradigm-level evaluation to reveal how tokenization acts as a linguistic filter, and shows that performance improves on more synthetic data across models. Overall, the findings underscore the need for linguistically informed tokenization choices when modeling languages with diverse morphological typologies and demonstrate the utility of BLM tasks for diagnosing such interactions in internal representations.
Abstract
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
