Table of Contents
Fetching ...

Investigating Critical Period Effects in Language Acquisition through Neural Language Models

Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, Alex Warstadt

TL;DR

The paper uses bilingual language models to probe critical period CP effects in language learning by varying exposure age to $L_2$ and testing $L_1$ attrition, finding that standard LMs do not exhibit human-like CP effects. It shows that introducing a plasticity-constraint via Elastic Weight Consolidation can produce CP-like patterns, providing a computational lens on innate versus experiential CP theories. The work argues that CP effects are not an inevitable outcome of statistical learning but may require innate maturational mechanisms or engineered constraints, and it offers a framework for making language models more developmentally plausible. These insights help delineate the boundaries between human-specific language acquisition mechanisms and general-purpose learning dynamics, with implications for cognitive modeling and reverse-engineering CP phenomena.

Abstract

Humans appear to have a critical period (CP) for language acquisition: Second language (L2) acquisition becomes harder after early childhood, and ceasing exposure to a first language (L1) after this period (but not before) typically does not lead to substantial loss of L1 proficiency. It is unknown whether these CP effects result from innately determined brain maturation or as a stabilization of neural connections naturally induced by experience. In this study, we use language models (LMs) to test the extent to which these phenomena are peculiar to humans, or shared by a broader class of language learners. We vary the age of exposure by training LMs on language pairs in various experimental conditions, and find that LMs, which lack any direct analog to innate maturational stages, do not show CP effects when the age of exposure of L2 is delayed. Our results contradict the claim that CP effects are an inevitable result of statistical learning, and they are consistent with an innate mechanism for CP effects. We show that we can reverse-engineer the CP by introducing a regularizer partway through training to simulate a maturational decrease in plasticity. All in all, our results suggest that L1 learning on its own may not be enough to induce a CP, and additional engineering is necessary to make language models more cognitively plausible.

Investigating Critical Period Effects in Language Acquisition through Neural Language Models

TL;DR

The paper uses bilingual language models to probe critical period CP effects in language learning by varying exposure age to and testing attrition, finding that standard LMs do not exhibit human-like CP effects. It shows that introducing a plasticity-constraint via Elastic Weight Consolidation can produce CP-like patterns, providing a computational lens on innate versus experiential CP theories. The work argues that CP effects are not an inevitable outcome of statistical learning but may require innate maturational mechanisms or engineered constraints, and it offers a framework for making language models more developmentally plausible. These insights help delineate the boundaries between human-specific language acquisition mechanisms and general-purpose learning dynamics, with implications for cognitive modeling and reverse-engineering CP phenomena.

Abstract

Humans appear to have a critical period (CP) for language acquisition: Second language (L2) acquisition becomes harder after early childhood, and ceasing exposure to a first language (L1) after this period (but not before) typically does not lead to substantial loss of L1 proficiency. It is unknown whether these CP effects result from innately determined brain maturation or as a stabilization of neural connections naturally induced by experience. In this study, we use language models (LMs) to test the extent to which these phenomena are peculiar to humans, or shared by a broader class of language learners. We vary the age of exposure by training LMs on language pairs in various experimental conditions, and find that LMs, which lack any direct analog to innate maturational stages, do not show CP effects when the age of exposure of L2 is delayed. Our results contradict the claim that CP effects are an inevitable result of statistical learning, and they are consistent with an innate mechanism for CP effects. We show that we can reverse-engineer the CP by introducing a regularizer partway through training to simulate a maturational decrease in plasticity. All in all, our results suggest that L1 learning on its own may not be enough to induce a CP, and additional engineering is necessary to make language models more cognitively plausible.
Paper Structure (56 sections, 10 equations, 6 figures, 7 tables)

This paper contains 56 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A visualization of the training conditions, using $\mathrm{L}_1\xspace=\mathtt{de}$, $\mathrm{L}_2\xspace=\mathtt{en}$, $\mathrm{S}\xspace=600$M, $\mathrm{E}\xspace=6$.
  • Figure 2: $\mathrm{L}_2$ ($\mathtt{en}$) results for regular training (6 epochs). Results are aggregated across model configuration and $\mathrm{L}_1$ ($\mathtt{de}$ and $\mathtt{fi}$). Top: PPL per character on $\mathrm{L}_2$ ($\mathtt{en}$) during training on $\mathrm{L}_2$. Middle: Accuracy on BLiMP during training on $\mathrm{L}_2$. Bottom: Performance on GLUE at the end of training.
  • Figure 3: $\mathrm{L}_1$ ($\mathtt{en}$) results when the language order is reversed (6 + 6 epochs). Results are aggregated across $\mathrm{L}_2$ ($\mathtt{de}$ and $\mathtt{fi}$). Top: PPL per character on the $\mathrm{L}_1$ ($\mathtt{en}$) validation set during training. Middle: Accuracy on BLiMP during training. Bottom: Performance on GLUE at the end of training on $\mathrm{L}_1$ and $\mathrm{L}_2$.
  • Figure 4: Summary of the $\mathrm{L}_2$ ($\mathtt{en}$) evaluation results for the convergence training (48 epochs). Results are aggregated across $\mathrm{L}_1$ ($\mathtt{de}$ and $\mathtt{fi}$). Top left: PPL per character on the $\mathrm{L}_2$ ($\mathtt{en}$) validation set during training on $\mathrm{L}_2$. Top right: PPL per character on the $\mathrm{L}_1$ ($\mathtt{de}$, $\mathtt{fi}$) validation set during training. Bottom left: Accuracy on BLiMP during $\mathrm{L}_2$ training. Bottom right: Performance on GLUE at the end of $\mathrm{L}_2$ training.
  • Figure 5: Trade-off between $\mathrm{L}_1$ and $\mathrm{L}_2$ performance (CE) at the end of training as a function of $\lambda$ (EWC strength). Results are aggregated across $\mathrm{L}_1$ ($\mathtt{de}$ and $\mathtt{fi}$).
  • ...and 1 more figures