Using Context to Improve Word Segmentation
Stephanie Hu, Xiaolu Guo
TL;DR
The paper investigates how context influences word segmentation in infant language acquisition by implementing two Bayesian nonparametric models, a unigram and a bigram, derived from CRP and DP formalisms. Gibbs sampling is used to infer word boundaries from child-directed speech, with annealing and various experimental tweaks to assess robustness. The key finding is that the bigram model, which incorporates preceding-word context, generally outperforms the unigram in boundary prediction, and allowing partial prior vocabulary knowledge further boosts performance on the lexicon. These results support the role of context in early word segmentation and point to future work such as trigram and bidirectional models to more accurately mirror human learning processes and address remaining segmentation challenges.
Abstract
An important step in understanding how children acquire languages is studying how infants learn word segmentation. It has been established in previous research that infants may use statistical regularities in speech to learn word segmentation. The research of Goldwater et al., demonstrated that incorporating context in models improves their ability to learn word segmentation. We implemented two of their models, a unigram and bigram model, to examine how context can improve statistical word segmentation. The results are consistent with our hypothesis that the bigram model outperforms the unigram model at predicting word segmentation. Extending the work of Goldwater et al., we also explored basic ways to model how young children might use previously learned words to segment new utterances.
