Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Catherine Arnett; Pamela D. Rivière; Tyler A. Chang; Sean Trott

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Catherine Arnett, Pamela D. Rivière, Tyler A. Chang, Sean Trott

Abstract

The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Abstract

Paper Structure (20 sections, 3 figures, 2 tables)

This paper contains 20 sections, 3 figures, 2 tables.

Introduction
Related Work
Model and Data
Data
Identifying Tokenization Type
Relationship of Tokenization to Frequency
Artificial Tokenization Procedure
Study: Article-Noun Agreement
Method
Results
Impact of Initial Tokenization
Success of Artificial Tokenization
Comparing Default vs. Artificial Tokenization Schemes
Linear Discriminant Analysis (LDA)
Discussion and Conclusion
...and 5 more sections

Figures (3)

Figure 1: Log-odds varied significantly as a function of noun number (singular vs. plural). The extent of this variance interacted (weakly) with initial tokenization (morphemic vs. non-morphemic vs. single-token) and with whether the original or artificial tokenization procedure was used. Larger log-odds indicate higher probabilities of the plural article.
Figure 2: LDA for singular and plural embeddings reveals axes of overlap (left) and discriminability (right) for differentially tokenized plural forms.
Figure 3: Single-token plurals were significantly more frequent than those tokenized according to morphemic boundaries, which were more frequent than those tokenized according to non-morphemic substrings.

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Abstract

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Authors

Abstract

Table of Contents

Figures (3)