Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Brian DuSell; David Chiang

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Brian DuSell, David Chiang

TL;DR

The paper addresses transformers' difficulty with hierarchical patterns by introducing stack attention, which embeds differentiable stacks into the transformer attention mechanism. It proposes two variants: a deterministic, superposition-based stack and a nondeterministic vector PDA (dVPDA), enabling recognition of context-free languages without supervision. Empirically, nondeterministic stack attention improves CFL learning tasks and language modeling perplexity under constrained parameters, though machine translation results are mixed. The work advances unsupervised syntax modeling in transformers, suggesting potential for more data-efficient hierarchical language understanding and broader NLP impacts.

Abstract

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

TL;DR

Abstract

Paper Structure (29 sections, 20 equations, 4 figures, 6 tables)

This paper contains 29 sections, 20 equations, 4 figures, 6 tables.

Introduction
Related work
Background
Scaled dot-product attention
Differentiable stacks
Superposition stack
Nondeterministic stack
Comparison of differentiable stacks
Method
Context-free languages
Natural language modeling
Machine translation
Conclusion
Details of pushdown automata
Implementation details of the dVPDA
...and 14 more sections

Figures (4)

Figure 1: Conceptual diagram of a stack attention sublayer, unrolled across a portion of time. Dotted arrows indicate linear transformations, and dashed arrows indicate residual connections.
Figure 2: Language modeling results on context-free languages, comparing transformers with and without stack attention, as well as their LSTM counterparts. Left: Cross-entropy difference ($\downarrow$) in nats between model and source distribution on the validation set, as a function of training time. Lines are the best of 10 runs, selected by validation cross-entropy difference. Right: Cross-entropy difference ($\downarrow$) on the test set, binned by string length. The dashed line indicates the longest length in the training set. See \ref{['tab:cfl-parameter-count']} for model parameter counts. Nondeterministic stack attention (Tf+Nd) outperforms standard attention (Tf) on $w w^R$, $w a^p w^R$, and Hardest CFL; and it achieves the best in-distribution performance on Hardest CFL despite having the fewest parameters.
Figure 3: Visualization of superposition stack attention on a string in $w \texttt{\#} w^R$. As expected, the model learns to push all symbols before #, do nothing when reading #, and pop all symbols after #.
Figure 4: Visualization of superposition stack attention on a string in the Dyck language. As expected, the model learns to push opening brackets and pop when reading closing brackets.

Theorems & Definitions (1)

Definition 1

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

TL;DR

Abstract

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)