Table of Contents
Fetching ...

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Catherine Arnett, Pamela D. Rivière, Tyler A. Chang, Sean Trott

Abstract

The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Abstract

The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.
Paper Structure (20 sections, 3 figures, 2 tables)

This paper contains 20 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Log-odds varied significantly as a function of noun number (singular vs. plural). The extent of this variance interacted (weakly) with initial tokenization (morphemic vs. non-morphemic vs. single-token) and with whether the original or artificial tokenization procedure was used. Larger log-odds indicate higher probabilities of the plural article.
  • Figure 2: LDA for singular and plural embeddings reveals axes of overlap (left) and discriminability (right) for differentially tokenized plural forms.
  • Figure 3: Single-token plurals were significantly more frequent than those tokenized according to morphemic boundaries, which were more frequent than those tokenized according to non-morphemic substrings.