Linguistic Structure Induction from Language Models
Omar Momen
TL;DR
The thesis addresses unsupervised induction of syntactic structure (constituency and dependency) from language models, focusing on StructFormer, a transformer augmented with a parser network that outputs Syntactic Distances $D$ and Syntactic Heights $H$ and uses a differentiable dependency function to guide attention. It reproduces StructFormer on PTB, investigates architectural variants such as in-between parser placement, and extends to subword tokenization and BabyLM pretraining to assess data efficiency and cross-domain generalization. Across MLM, constituency parsing (UF1), and dependency parsing (UAS), results show StructFormer can induce meaningful trees and can improve or match baselines in certain settings, with mid-layer parser placement often yielding the best language modeling gains but occasionally introducing instability in tree induction. The work highlights the promise and limitations of retrofitting transformer models for syntactic structure induction, stressing the need for standardized benchmarks, diversified data, and broader linguistic evaluation to advance the field. It also demonstrates that subword tokenization is feasible for structure induction and that BabyLM-scale experiments provide a data-efficient platform for evaluating cognitively plausible models. Overall, the findings support continued exploration of syntactic inductive biases in LMs, while signaling avenues for better stability, multilingual extension, and semantics-aware evaluation.
Abstract
Linear sequences of words are implicitly represented in our brains by hierarchical structures that organize the composition of words in sentences. Linguists formalize different frameworks to model this hierarchy; two of the most common syntactic frameworks are Constituency and Dependency. Constituency represents sentences as nested groups of phrases, while dependency represents a sentence by assigning relations between its words. Recently, the pursuit of intelligent machines has produced Language Models (LMs) capable of solving many language tasks with a human-level performance. Many studies now question whether LMs implicitly represent syntactic hierarchies. This thesis focuses on producing constituency and dependency structures from LMs in an unsupervised setting. I review the critical methods in this field and highlight a line of work that utilizes a numerical representation for binary constituency trees (Syntactic Distance). I present a detailed study on StructFormer (SF) (Shen et al., 2021), which retrofits a transformer encoder architecture with a parser network to produce constituency and dependency structures. I present six experiments to analyze and address this field's challenges; experiments include investigating the effect of repositioning the parser network within the SF architecture, evaluating subword-based induced trees, and benchmarking the models developed in the thesis experiments on linguistic tasks. Models benchmarking is performed by participating in the BabyLM challenge, published at CoNLL 2023 (Momen et al., 2023). The results of this thesis encourage further development in the direction of retrofitting transformer-based models to induce syntactic structures, supported by the acceptable performance of SF in different experimental settings and the observed limitations that require innovative solutions to advance the state of syntactic structure induction.
