Robustness of the Random Language Model

Fatemeh Lalegani; Eric De Giuli

Robustness of the Random Language Model

Fatemeh Lalegani, Eric De Giuli

TL;DR

The paper investigates the robustness of the Random Language Model (RLM), an ensemble of stochastic context-free grammars, to explicit symmetry breaking and surface biases, and examines whether its proposed transition to grammatical syntax remains intact under realistic extensions. By analyzing surface heterogeneity, Zipfian biases, and finite-size effects, the authors show the transition persists and can be characterized by an effective surface temperature $\tilde{\epsilon}_s^{eff}$, with a sharp thermodynamic transition expected in the $N\to\infty$ limit. Comparisons with human syntactic networks (e.g., clustering in sentence graphs around 24 months) reveal qualitative alignment between the RLM transition and early language development, supporting the idea that a single continuous learning transition underlies syntax emergence. The work also discusses implications for linguistic theory (e.g., continuous vs discrete learning) and for connections to modern ML approaches, while outlining avenues for analytic solutions in idealized limits.

Abstract

The Random Language Model (De Giuli 2019) is an ensemble of stochastic context-free grammars, quantifying the syntax of human and computer languages. The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages. In its simplest formulation, it implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken. Here this picture is scrutinized by considering its robustness against extensions of the original model, and trajectories through parameter space different from those originally considered. It is shown here that (i) the scenario is robust to explicit symmetry breaking, an inevitable component of learning in the real world; and (ii) the transition to grammatical syntax can be encountered by fixing the deep (hidden) structure while varying the surface (observable) properties. It is also argued that the transition becomes a sharp thermodynamic transition in an idealized limit. Moreover, comparison with human data on the clustering coefficient of syntax networks suggests that the observed transition is equivalent to that normally experienced by children at age 24 months. The results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.

Robustness of the Random Language Model

TL;DR

, with a sharp thermodynamic transition expected in the

limit. Comparisons with human syntactic networks (e.g., clustering in sentence graphs around 24 months) reveal qualitative alignment between the RLM transition and early language development, supporting the idea that a single continuous learning transition underlies syntax emergence. The work also discusses implications for linguistic theory (e.g., continuous vs discrete learning) and for connections to modern ML approaches, while outlining avenues for analytic solutions in idealized limits.

Abstract

Paper Structure (8 sections, 33 equations, 9 figures)

This paper contains 8 sections, 33 equations, 9 figures.

Brief review of the Random Language Model
The RLM transition is encountered by increasing surface heterogeneity
Learning a context-free grammar
RLM with a bias
Comparison with human data
Finite-size scaling
Discussion
Conclusion

Figures (9)

Figure 1: Illustrative derivation trees for (a) simple English sentence, and (b) RNA secondary structure (after Searls02). The latter is a derivation of the sequence 'gacuaagcugaguc' and shows its folded structure. Terminal symbols are encircled. Figure reproduced from DeGiuli19.
Figure 2: Phase diagram of the RLM, in the replica-symmetric approximation. Text is grammatical in the lower-left region, demarcated approximately by $\tilde{\epsilon}_s \log T\approx 1, \tilde{\epsilon}_d \log N\approx 1$ (light dotted). Three paths $\gamma_j$ through the diagram are sketched: $\gamma_1$ at fixed $\tilde{\epsilon}_s$, considered in DeGiuli19; $\gamma_2$ with $\tilde{\epsilon}_s=\tilde{\epsilon}_d$, discussed below; and $\gamma_3$ at fixed $\tilde{\epsilon}_d$, also discussed below.
Figure 3: The RLM transition can be encountered by lowering the surface temperature $\epsilon_s$. Curves are shown at $T=1000$, $\tilde{\epsilon}_d \approx 0.03$, and indicated values of $N$; (a) the surface entropy drops around $\tilde{\epsilon}_s\approx 1/\log T$, while (b) the surface order parameter $P_2$ increases as $\tilde{\epsilon}_s$ is lowered.
Figure 4: The RLM transition is robust to the addition of a Zipfian surface bias. Curves are shown at $T=100$, $\tilde{\epsilon}_d \approx 0.03$, and indicated values of bias strength $h$; (a) the surface entropy versus $\tilde{\epsilon}_s$; bias increases from left to right; (b) the surface entropy versus an effective $\tilde{\epsilon}_s^{eff}(\tilde{\epsilon}_s,h)$ (see text). The onset of nontrivial surface entropy occurs at approximately $\tilde{\epsilon}_s^{eff}\approx 1$, but its development is weaker at larger biases. In (c) the same data from (b) is shown as an approach to the trivial value $H_s \to \log T$, valid as $\tilde{\epsilon}_s \to \infty$. All curves intersect approximately at $\tilde{\epsilon}_s\approx 1$.
Figure 5: Example syntax forest (a), dependency graph (b), and directed sentence graph (c) obtained from human data. Note that the word 'fix' appeared in the dependency graph of Corominas-Murtra09 but not in the syntax tree shown therein.
...and 4 more figures

Robustness of the Random Language Model

TL;DR

Abstract

Robustness of the Random Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)