Modelling Child Learning and Parsing of Long-range Syntactic Dependencies
Louis Mahon, Mark Johnson, Mark Steedman
TL;DR
This work presents a probabilistic, CCg-based model of child language acquisition that jointly learns word meanings and language-specific syntax from real child-directed speech paired with logical forms. By modeling parse structure and meaning as latent variables and employing a Dirichlet-process–driven EM algorithm, the approach achieves parsing and meaning inference for unseen utterances, including long-range dependencies such as object wh-questions. The model attains strong results on word-order acquisition, lexical meaning and category learning, and robust full-utterance understanding, while demonstrating one-shot learning of nonce words and resilience to distractor meanings. Compared with prior work, it expands construction coverage, delivers fully correct parses for unseen data, and shows superior robustness and accuracy in meaning inference, providing insights into how children might learn syntax-semantics mappings from limited input.
Abstract
This work develops a probabilistic child language acquisition model to learn a range of linguistic phenonmena, most notably long-range syntactic dependencies of the sort found in object wh-questions, among other constructions. The model is trained on a corpus of real child-directed speech, where each utterance is paired with a logical form as a meaning representation. It then learns both word meanings and language-specific syntax simultaneously. After training, the model can deduce the correct parse tree and word meanings for a given utterance-meaning pair, and can infer the meaning if given only the utterance. The successful modelling of long-range dependencies is theoretically important because it exploits aspects of the model that are, in general, trans-context-free.
