Linguistic Structure Induction from Language Models

Omar Momen

Linguistic Structure Induction from Language Models

Omar Momen

TL;DR

The thesis addresses unsupervised induction of syntactic structure (constituency and dependency) from language models, focusing on StructFormer, a transformer augmented with a parser network that outputs Syntactic Distances $D$ and Syntactic Heights $H$ and uses a differentiable dependency function to guide attention. It reproduces StructFormer on PTB, investigates architectural variants such as in-between parser placement, and extends to subword tokenization and BabyLM pretraining to assess data efficiency and cross-domain generalization. Across MLM, constituency parsing (UF1), and dependency parsing (UAS), results show StructFormer can induce meaningful trees and can improve or match baselines in certain settings, with mid-layer parser placement often yielding the best language modeling gains but occasionally introducing instability in tree induction. The work highlights the promise and limitations of retrofitting transformer models for syntactic structure induction, stressing the need for standardized benchmarks, diversified data, and broader linguistic evaluation to advance the field. It also demonstrates that subword tokenization is feasible for structure induction and that BabyLM-scale experiments provide a data-efficient platform for evaluating cognitively plausible models. Overall, the findings support continued exploration of syntactic inductive biases in LMs, while signaling avenues for better stability, multilingual extension, and semantics-aware evaluation.

Abstract

Linear sequences of words are implicitly represented in our brains by hierarchical structures that organize the composition of words in sentences. Linguists formalize different frameworks to model this hierarchy; two of the most common syntactic frameworks are Constituency and Dependency. Constituency represents sentences as nested groups of phrases, while dependency represents a sentence by assigning relations between its words. Recently, the pursuit of intelligent machines has produced Language Models (LMs) capable of solving many language tasks with a human-level performance. Many studies now question whether LMs implicitly represent syntactic hierarchies. This thesis focuses on producing constituency and dependency structures from LMs in an unsupervised setting. I review the critical methods in this field and highlight a line of work that utilizes a numerical representation for binary constituency trees (Syntactic Distance). I present a detailed study on StructFormer (SF) (Shen et al., 2021), which retrofits a transformer encoder architecture with a parser network to produce constituency and dependency structures. I present six experiments to analyze and address this field's challenges; experiments include investigating the effect of repositioning the parser network within the SF architecture, evaluating subword-based induced trees, and benchmarking the models developed in the thesis experiments on linguistic tasks. Models benchmarking is performed by participating in the BabyLM challenge, published at CoNLL 2023 (Momen et al., 2023). The results of this thesis encourage further development in the direction of retrofitting transformer-based models to induce syntactic structures, supported by the acceptable performance of SF in different experimental settings and the observed limitations that require innovative solutions to advance the state of syntactic structure induction.

Linguistic Structure Induction from Language Models

TL;DR

and Syntactic Heights

and uses a differentiable dependency function to guide attention. It reproduces StructFormer on PTB, investigates architectural variants such as in-between parser placement, and extends to subword tokenization and BabyLM pretraining to assess data efficiency and cross-domain generalization. Across MLM, constituency parsing (UF1), and dependency parsing (UAS), results show StructFormer can induce meaningful trees and can improve or match baselines in certain settings, with mid-layer parser placement often yielding the best language modeling gains but occasionally introducing instability in tree induction. The work highlights the promise and limitations of retrofitting transformer models for syntactic structure induction, stressing the need for standardized benchmarks, diversified data, and broader linguistic evaluation to advance the field. It also demonstrates that subword tokenization is feasible for structure induction and that BabyLM-scale experiments provide a data-efficient platform for evaluating cognitively plausible models. Overall, the findings support continued exploration of syntactic inductive biases in LMs, while signaling avenues for better stability, multilingual extension, and semantics-aware evaluation.

Abstract

Paper Structure (42 sections, 27 equations, 33 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 27 equations, 33 figures, 8 tables, 2 algorithms.

Foundations
Introduction
Thesis Structure
Preliminaries
Linguistics
Deep Learning
Deep Learning and Syntax
Related Work
Problem Definition
Overview of Approaches
Syntax Distance Approach
Survey Works
Observations on Related Work
StructFormer: An Illustrative Study
Model Architecture
...and 27 more sections

Figures (33)

Figure 1: Two constituency trees corresponding to Interpretation #1 (left) and Interpretation #2 (right). Constituency Parsing is one of the most common structural syntactic frameworks. These annotations are inspired by popular examples in the field as in Charniak1997StatisticalPW.
Figure 2: An Example of a Constituency Tree for the sentence: The cat sat on the mat.
Figure 3: An Example of a Dependency Parse in CoNLL format for the sentence: The cat sat on the mat. ID is the word index in the sentence, Word is the actual word in the sentence, Lemma is the base or dictionary form of the word, POS is the part-of-speech tag of the word, Head is the ID of the head word for each word, and DepRel is the dependency relation label. The root of the sentence (the main verb) points to "0" because it doesn't have a head.
Figure 4: An Example of a Dependency Tree for the sentence: The cat sat on the mat.
Figure 5: The Transformer model architecture. The figure is copied from vaswani17
...and 28 more figures

Linguistic Structure Induction from Language Models

TL;DR

Abstract

Linguistic Structure Induction from Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (33)