Table of Contents
Fetching ...

Modelling Child Learning and Parsing of Long-range Syntactic Dependencies

Louis Mahon, Mark Johnson, Mark Steedman

TL;DR

This work presents a probabilistic, CCg-based model of child language acquisition that jointly learns word meanings and language-specific syntax from real child-directed speech paired with logical forms. By modeling parse structure and meaning as latent variables and employing a Dirichlet-process–driven EM algorithm, the approach achieves parsing and meaning inference for unseen utterances, including long-range dependencies such as object wh-questions. The model attains strong results on word-order acquisition, lexical meaning and category learning, and robust full-utterance understanding, while demonstrating one-shot learning of nonce words and resilience to distractor meanings. Compared with prior work, it expands construction coverage, delivers fully correct parses for unseen data, and shows superior robustness and accuracy in meaning inference, providing insights into how children might learn syntax-semantics mappings from limited input.

Abstract

This work develops a probabilistic child language acquisition model to learn a range of linguistic phenonmena, most notably long-range syntactic dependencies of the sort found in object wh-questions, among other constructions. The model is trained on a corpus of real child-directed speech, where each utterance is paired with a logical form as a meaning representation. It then learns both word meanings and language-specific syntax simultaneously. After training, the model can deduce the correct parse tree and word meanings for a given utterance-meaning pair, and can infer the meaning if given only the utterance. The successful modelling of long-range dependencies is theoretically important because it exploits aspects of the model that are, in general, trans-context-free.

Modelling Child Learning and Parsing of Long-range Syntactic Dependencies

TL;DR

This work presents a probabilistic, CCg-based model of child language acquisition that jointly learns word meanings and language-specific syntax from real child-directed speech paired with logical forms. By modeling parse structure and meaning as latent variables and employing a Dirichlet-process–driven EM algorithm, the approach achieves parsing and meaning inference for unseen utterances, including long-range dependencies such as object wh-questions. The model attains strong results on word-order acquisition, lexical meaning and category learning, and robust full-utterance understanding, while demonstrating one-shot learning of nonce words and resilience to distractor meanings. Compared with prior work, it expands construction coverage, delivers fully correct parses for unseen data, and shows superior robustness and accuracy in meaning inference, providing insights into how children might learn syntax-semantics mappings from limited input.

Abstract

This work develops a probabilistic child language acquisition model to learn a range of linguistic phenonmena, most notably long-range syntactic dependencies of the sort found in object wh-questions, among other constructions. The model is trained on a corpus of real child-directed speech, where each utterance is paired with a logical form as a meaning representation. It then learns both word meanings and language-specific syntax simultaneously. After training, the model can deduce the correct parse tree and word meanings for a given utterance-meaning pair, and can infer the meaning if given only the utterance. The successful modelling of long-range dependencies is theoretically important because it exploits aspects of the model that are, in general, trans-context-free.

Paper Structure

This paper contains 31 sections, 7 equations, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: Example of a ccg derivation for a simple transitive sentence from the Adam (English) corpus.
  • Figure 2: Example of the alternative, type-raise and compose, ccg derivation for the sentence in Figure \ref{['fig:ccg-example']} from the Adam (English) corpus.
  • Figure 3: Example of a ccg derivation of the object-wh question corresponding to Figure \ref{['fig:ccg-example']}.
  • Figure 4: Graphical model for for our probabilistic model. $T$ is the parse tree, $e_s$, $m_s$ and $w_s$ are the leaf-level shell lf, lf and word, $m$ is the root-level lf, $w$ is the utterance and $\theta_x$ is the subset of the full set $\theta$ of model parameters, consisting of the cooccurence counts in the distribution $x$, as described in Section \ref{['subsec:prob-model']}. Green indicates that a variable is observed, and red indicates unobserved. These colours are for train time, at test time, $m$ would also be red. The fact that $w$ is observed but $w_s$ is not reflects the fact that the model sees the full utterance, but not where the word boundaries should be, and similarly with $m$ and $m_s$.
  • Figure 5: One of the parses considered by the learner for this example. Given information is in green, inferred information is in pink. As this is train time, the model sees both the utterance and the root lf. Strictly speaking, the model only the full utterance and does not see individual words because the boundaries between words are not given. This is reflected in Figure \ref{['fig:graphical-model']}.
  • ...and 12 more figures