Table of Contents
Fetching ...

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer

TL;DR

This work introduces a new paradigm that encodes the same information with segments of consistent size across diverse languages, based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods.

Abstract

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

TL;DR

This work introduces a new paradigm that encodes the same information with segments of consistent size across diverse languages, based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods.

Abstract

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
Paper Structure (35 sections, 5 equations, 12 figures, 8 tables)

This paper contains 35 sections, 5 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: The same phrase is spelled in three languages: English, Czech, and Telugu. UTF-8 byte encoding of the phrase is shown in blue, while MYTE in green underneath. MYTE achieves higher encoding compression, especially for texts using diacritics or non-Latin script.
  • Figure 2: UTF-8 codepage (inspired by the visualizations from: en.wikipedia.org/wiki/UTF-8). Each row contains 16 bytes with the same leading hexadecimal digit. Bytes in the range - are leading bytes. They mark the beginning of a multibyte code of the length shown in each cell. Bytes in the range - are continuation bytes, which follow a leading byte in multibyte codes. Bytes and are unused. Range - encodes Latin capital letters. In MYTE, these characters are decomposed to free space used to encode morphemes.
  • Figure 3: Boxplot aggregating parity against English for three segmentation methods: MYTE, UTF-8, characters, and subword tokens from mT5 tokenizer xue-etal-2021-mt5. Parities were computed on multi-parallel Flores 200 corpus.
  • Figure 4: Average byte sequence lengths of parallel sentences from Flores 200 encoded by a) UTF-8 and b) MYTE. Figure c) depicts the percentage by which the latter sequences are shorter than the former. Results for all the languages can be found in Appendix \ref{['sec:app-results']}.
  • Figure 5: The difference in Byte-per-English-Bit and inference time between MyT5 and ByT5 large models against compression factor of MYTE. For each sentence, the BPEB value is normalized by the number of UTF-8 bytes used to represent the corresponding English sentence. The inference was run on A40 GPU core, we report an average per-sentence deltas. $\rho_S$ are Spearman's correlation coefficients.
  • ...and 7 more figures