Table of Contents
Fetching ...

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

TL;DR

This work develops a pipeline to convert text datasets into a continuous stream of phonemes, and shows that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Abstract

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

TL;DR

This work develops a pipeline to convert text datasets into a continuous stream of phonemes, and shows that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Abstract

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Paper Structure

This paper contains 38 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An illustration of all three adjustments that we make to convert text input to continuous streams of phonemes.
  • Figure 2: Mean (with Min and Max range) percentage difference achieved on each benchmark's macro score as a result of the three adjustments.
  • Figure 3: The overall BLiMP scores achieved by GPT-2 in our eight conditions with and without the UTT_BOUNDARY token (used to separate sentences) included at the end of evaluation instances.