Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Yida Zhao; Chao Lou; Kewei Tu

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Yida Zhao, Chao Lou, Kewei Tu

TL;DR

DTGs introduce a dependency-based inductive bias into Transformer language models by modeling the joint distribution $p(\mathbf{x}, \mathbf{y})$ over sentences and dependency trees and approximating $p(\mathbf{x})$ via a proposal set $\mathbf{Y'}$ of trees, i.e., $p(\mathbf{x}) \approx \sum_{\mathbf{y} \in \mathbf{Y'}} p(\mathbf{x}, \mathbf{y})$. They implement this bias by autoregressively generating dependency transition sequences with constrained attention: STACK attention tracks the current stack for GEN/arc2 steps, and COMPOSE attention updates head representations by focusing on the top two stack items, with transitions duplicated to separate generation from composition. The model uses Transformer-XL style relative positional encoding tied to stack depth and augments arc representations by summing the arc-type embedding with the head token embedding. Experimental results on the BLLIP-LG corpus show DTGs achieve perplexities comparable to Transformer baselines while delivering stronger syntactic generalization on BLiMP and SG tasks, and parse reranking experiments indicate the learned dependencies align with external parsers. Overall, the work demonstrates that explicit dependency structures can guide Transformer LMs to better generalize syntactically, suggesting avenues for integrating dependency information with semantic capabilities and broader dependency representations.

Abstract

Syntactic Transformer language models aim to achieve better generalization through simultaneously modeling syntax trees and sentences. While prior work has been focusing on adding constituency-based structures to Transformers, we introduce Dependency Transformer Grammars (DTGs), a new class of Transformer language model with explicit dependency-based inductive bias. DTGs simulate dependency transition systems with constrained attention patterns by modifying attention masks, incorporate the stack information through relative positional encoding, and augment dependency arc representation with a combination of token embeddings and operation embeddings. When trained on a dataset of sentences annotated with dependency trees, DTGs achieve better generalization while maintaining comparable perplexity with Transformer language model baselines. DTGs also outperform recent constituency-based models, showing that dependency can better guide Transformer language models. Our code is released at https://github.com/zhaoyd1/Dep_Transformer_Grammars.

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

TL;DR

DTGs introduce a dependency-based inductive bias into Transformer language models by modeling the joint distribution

over sentences and dependency trees and approximating

via a proposal set

of trees, i.e.,

. They implement this bias by autoregressively generating dependency transition sequences with constrained attention: STACK attention tracks the current stack for GEN/arc2 steps, and COMPOSE attention updates head representations by focusing on the top two stack items, with transitions duplicated to separate generation from composition. The model uses Transformer-XL style relative positional encoding tied to stack depth and augments arc representations by summing the arc-type embedding with the head token embedding. Experimental results on the BLLIP-LG corpus show DTGs achieve perplexities comparable to Transformer baselines while delivering stronger syntactic generalization on BLiMP and SG tasks, and parse reranking experiments indicate the learned dependencies align with external parsers. Overall, the work demonstrates that explicit dependency structures can guide Transformer LMs to better generalize syntactically, suggesting avenues for integrating dependency information with semantic capabilities and broader dependency representations.

Abstract

Paper Structure (32 sections, 5 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 5 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries: Transition-based Dependency Parsing
Model
Arc-Standard via Attention Mask
Relative Positional Encoding
Arc Representation
Other Transition Systems via Attention Mask
Constraints on Inference
Experiments
Dataset and Preprocessing
Training Details
Sentence-Level Language Modeling
Setup
Results
Syntactic Generalization
...and 17 more sections

Figures (5)

Figure 1: An example sentence with its dependency tree and transition sequence. Numbers in blue and red are indices of tokens and arcs respectively.
Figure 2: Transition sequence and attention masks of an example sentence
Figure 3: Scores on the six circuits of the SG test suites.
Figure 4: Arc-eager processing of an example sentence
Figure 5: Arc-swift processing of an example sentence

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

TL;DR

Abstract

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)