Table of Contents
Fetching ...

Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization

Zilong Li

TL;DR

The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers and investigate how a tokenizer's vocabulary influences the performance of language models.

Abstract

This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.

Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization

TL;DR

The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers and investigate how a tokenizer's vocabulary influences the performance of language models.

Abstract

This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.

Paper Structure

This paper contains 23 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Token frequency distribution of tokenizers