Table of Contents
Fetching ...

A Language-agnostic Model of Child Language Acquisition

Louis Mahon, Omri Abend, Uri Berger, Katherine Demuth, Mark Johnson, Mark Steedman

TL;DR

This work investigates whether a language-agnostic semantic bootstrapping model for child language acquisition can transfer from English to Hebrew. It reimplements Abend 2017's CC G-based framework, training on real CHILDES utterances paired with logical forms and using an EM-style algorithm with Dirichlet-process conditionals to jointly learn syntax and word meanings. Across English (Adam) and Hebrew (Hagar), the model achieves high word-meaning accuracy and learns a dominant SVO order, but Hebrew shows slower, less robust word-order and syntactic-category learning due to richer morphology. The findings highlight the value of cross-language evaluation for CLA models and point to morphology-aware extensions to improve multilingual acquisition capabilities.

Abstract

This work reimplements a recent semantic bootstrapping child-language acquisition model, which was originally designed for English, and trains it to learn a new language: Hebrew. The model learns from pairs of utterances and logical forms as meaning representations, and acquires both syntax and word meanings simultaneously. The results show that the model mostly transfers to Hebrew, but that a number of factors, including the richer morphology in Hebrew, makes the learning slower and less robust. This suggests that a clear direction for future work is to enable the model to leverage the similarities between different word forms.

A Language-agnostic Model of Child Language Acquisition

TL;DR

This work investigates whether a language-agnostic semantic bootstrapping model for child language acquisition can transfer from English to Hebrew. It reimplements Abend 2017's CC G-based framework, training on real CHILDES utterances paired with logical forms and using an EM-style algorithm with Dirichlet-process conditionals to jointly learn syntax and word meanings. Across English (Adam) and Hebrew (Hagar), the model achieves high word-meaning accuracy and learns a dominant SVO order, but Hebrew shows slower, less robust word-order and syntactic-category learning due to richer morphology. The findings highlight the value of cross-language evaluation for CLA models and point to morphology-aware extensions to improve multilingual acquisition capabilities.

Abstract

This work reimplements a recent semantic bootstrapping child-language acquisition model, which was originally designed for English, and trains it to learn a new language: Hebrew. The model learns from pairs of utterances and logical forms as meaning representations, and acquires both syntax and word meanings simultaneously. The results show that the model mostly transfers to Hebrew, but that a number of factors, including the richer morphology in Hebrew, makes the learning slower and less robust. This suggests that a clear direction for future work is to enable the model to leverage the similarities between different word forms.
Paper Structure (24 sections, 9 equations, 12 figures, 8 tables)

This paper contains 24 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Example of a CCG derivation for a simple sentence from the Adam (English) corpus.
  • Figure 2: Example of a CCG derivation for a simple sentence from the Hagar (Hebrew) corpus. The sentence translates to English as "hes cutting wood", literally "he cut-pres-p wood".
  • Figure 3: One of the parses considered by the learner for this example. Lambdas are written $L$ and variables are numbers $0, \dots,$.
  • Figure 4: Evolution, over the course of training, of the learner's preference for each of the six possible word orders on Adam (English). It learns rapidly and confidently to favour SVO.
  • Figure 5: Evolution, over the course of training, of the learner's preference for each of the six possible word orders on Hagar (Hebrew). It learns SVO confidently, but more gradually than on Adam (English), and there are visible jumps where the learner encountered data points that were key for syntax learning.
  • ...and 7 more figures