Table of Contents
Fetching ...

Understanding Syntactic Generalization in Structure-inducing Language Models

David Arps, Hassan Sajjad, Laura Kallmeyer

TL;DR

This study systematically compares three structure-inducing language models (GPST, StructFormer, UDGN) across English, German, Chinese, and formal Dyck languages to probe how unsupervised hierarchical representations emerge during self-supervised training. By introducing new formal-language benchmarks (Dyck-k and Dyck-u) and minimal-pair tests, the authors reveal substantial variability in induced syntactic representations across training runs and data, with GPST offering the most robust and linguistically coherent generalization, especially for long-distance dependencies. The findings highlight a mixed landscape: the best-performing architecture depends on the evaluation dimension, and none achieves universal dominance, underscoring the need for stability-focused design and broader multilingual evaluation. The work also provides benchmark data and an evaluation pipeline to guide future development of structure-inducing models and their scalability.

Abstract

Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. SiLMs couple strong syntactic generalization behavior with competitive performance on various NLP tasks, but many of their basic properties are yet underexplored. In this work, we train three different SiLM architectures from scratch: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022), and GPST (Hu et al., 2024b). We train these architectures on both natural language (English, German, and Chinese) corpora and synthetic bracketing expressions. The models are then evaluated with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

Understanding Syntactic Generalization in Structure-inducing Language Models

TL;DR

This study systematically compares three structure-inducing language models (GPST, StructFormer, UDGN) across English, German, Chinese, and formal Dyck languages to probe how unsupervised hierarchical representations emerge during self-supervised training. By introducing new formal-language benchmarks (Dyck-k and Dyck-u) and minimal-pair tests, the authors reveal substantial variability in induced syntactic representations across training runs and data, with GPST offering the most robust and linguistically coherent generalization, especially for long-distance dependencies. The findings highlight a mixed landscape: the best-performing architecture depends on the evaluation dimension, and none achieves universal dominance, underscoring the need for stability-focused design and broader multilingual evaluation. The work also provides benchmark data and an evaluation pipeline to guide future development of structure-inducing models and their scalability.

Abstract

Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. SiLMs couple strong syntactic generalization behavior with competitive performance on various NLP tasks, but many of their basic properties are yet underexplored. In this work, we train three different SiLM architectures from scratch: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022), and GPST (Hu et al., 2024b). We train these architectures on both natural language (English, German, and Chinese) corpora and synthetic bracketing expressions. The models are then evaluated with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

Paper Structure

This paper contains 50 sections, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Examples from the Dyck-$64$ language (top), and the Dyck-$u$ language (bottom).
  • Figure 2: $t_x$-consistency for natural languages (top) and Dyck languages (bottom), measured in UAS for $\mathrm{SF}$ and $\mathrm{UDGN}$ , and F score for $\mathrm{GPST}$ .
  • Figure 3: $t_x$-evolution for English
  • Figure 4: Performance on minimal pairs for Dyck-$u$, by distance between the brackets. The smallest distance, 2, refers to the case where the brackets are adjacent.
  • Figure 5: Test set Perplexity and Pseudo-Perplexity at checkpoints during training.
  • ...and 2 more figures