Table of Contents
Fetching ...

Using Context to Improve Word Segmentation

Stephanie Hu, Xiaolu Guo

TL;DR

The paper investigates how context influences word segmentation in infant language acquisition by implementing two Bayesian nonparametric models, a unigram and a bigram, derived from CRP and DP formalisms. Gibbs sampling is used to infer word boundaries from child-directed speech, with annealing and various experimental tweaks to assess robustness. The key finding is that the bigram model, which incorporates preceding-word context, generally outperforms the unigram in boundary prediction, and allowing partial prior vocabulary knowledge further boosts performance on the lexicon. These results support the role of context in early word segmentation and point to future work such as trigram and bidirectional models to more accurately mirror human learning processes and address remaining segmentation challenges.

Abstract

An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.

Using Context to Improve Word Segmentation

TL;DR

The paper investigates how context influences word segmentation in infant language acquisition by implementing two Bayesian nonparametric models, a unigram and a bigram, derived from CRP and DP formalisms. Gibbs sampling is used to infer word boundaries from child-directed speech, with annealing and various experimental tweaks to assess robustness. The key finding is that the bigram model, which incorporates preceding-word context, generally outperforms the unigram in boundary prediction, and allowing partial prior vocabulary knowledge further boosts performance on the lexicon. These results support the role of context in early word segmentation and point to future work such as trigram and bidirectional models to more accurately mirror human learning processes and address remaining segmentation challenges.

Abstract

An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.

Paper Structure

This paper contains 34 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Sample iterations of the Gibbs sampling algorithm under the unigram model. Figure originally appeared in Goldwater et al., 2009.
  • Figure 2: Sample utterances and their phonetic representation.
  • Figure 3: First ten lines of the corpus segmented using our inference model.
  • Figure 4: Result of sampling using only the first ten lines of the corpus.
  • Figure 5: Comparison of the most frequent words discovered by our model to the top-occurring words in the first 100 lines of the actual corpus.
  • ...and 4 more figures