Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models
Yida Zhao, Chao Lou, Kewei Tu
TL;DR
DTGs introduce a dependency-based inductive bias into Transformer language models by modeling the joint distribution $p(\mathbf{x}, \mathbf{y})$ over sentences and dependency trees and approximating $p(\mathbf{x})$ via a proposal set $\mathbf{Y'}$ of trees, i.e., $p(\mathbf{x}) \approx \sum_{\mathbf{y} \in \mathbf{Y'}} p(\mathbf{x}, \mathbf{y})$. They implement this bias by autoregressively generating dependency transition sequences with constrained attention: STACK attention tracks the current stack for GEN/arc2 steps, and COMPOSE attention updates head representations by focusing on the top two stack items, with transitions duplicated to separate generation from composition. The model uses Transformer-XL style relative positional encoding tied to stack depth and augments arc representations by summing the arc-type embedding with the head token embedding. Experimental results on the BLLIP-LG corpus show DTGs achieve perplexities comparable to Transformer baselines while delivering stronger syntactic generalization on BLiMP and SG tasks, and parse reranking experiments indicate the learned dependencies align with external parsers. Overall, the work demonstrates that explicit dependency structures can guide Transformer LMs to better generalize syntactically, suggesting avenues for integrating dependency information with semantic capabilities and broader dependency representations.
Abstract
Syntactic Transformer language models aim to achieve better generalization through simultaneously modeling syntax trees and sentences. While prior work has been focusing on adding constituency-based structures to Transformers, we introduce Dependency Transformer Grammars (DTGs), a new class of Transformer language model with explicit dependency-based inductive bias. DTGs simulate dependency transition systems with constrained attention patterns by modifying attention masks, incorporate the stack information through relative positional encoding, and augment dependency arc representation with a combination of token embeddings and operation embeddings. When trained on a dataset of sentences annotated with dependency trees, DTGs achieve better generalization while maintaining comparable perplexity with Transformer language model baselines. DTGs also outperform recent constituency-based models, showing that dependency can better guide Transformer language models. Our code is released at https://github.com/zhaoyd1/Dep_Transformer_Grammars.
