Table of Contents
Fetching ...

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Brian DuSell, David Chiang

TL;DR

The paper addresses transformers' difficulty with hierarchical patterns by introducing stack attention, which embeds differentiable stacks into the transformer attention mechanism. It proposes two variants: a deterministic, superposition-based stack and a nondeterministic vector PDA (dVPDA), enabling recognition of context-free languages without supervision. Empirically, nondeterministic stack attention improves CFL learning tasks and language modeling perplexity under constrained parameters, though machine translation results are mixed. The work advances unsupervised syntax modeling in transformers, suggesting potential for more data-efficient hierarchical language understanding and broader NLP impacts.

Abstract

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

TL;DR

The paper addresses transformers' difficulty with hierarchical patterns by introducing stack attention, which embeds differentiable stacks into the transformer attention mechanism. It proposes two variants: a deterministic, superposition-based stack and a nondeterministic vector PDA (dVPDA), enabling recognition of context-free languages without supervision. Empirically, nondeterministic stack attention improves CFL learning tasks and language modeling perplexity under constrained parameters, though machine translation results are mixed. The work advances unsupervised syntax modeling in transformers, suggesting potential for more data-efficient hierarchical language understanding and broader NLP impacts.

Abstract

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
Paper Structure (29 sections, 20 equations, 4 figures, 6 tables)

This paper contains 29 sections, 20 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Conceptual diagram of a stack attention sublayer, unrolled across a portion of time. Dotted arrows indicate linear transformations, and dashed arrows indicate residual connections.
  • Figure 2: Language modeling results on context-free languages, comparing transformers with and without stack attention, as well as their LSTM counterparts. Left: Cross-entropy difference ($\downarrow$) in nats between model and source distribution on the validation set, as a function of training time. Lines are the best of 10 runs, selected by validation cross-entropy difference. Right: Cross-entropy difference ($\downarrow$) on the test set, binned by string length. The dashed line indicates the longest length in the training set. See \ref{['tab:cfl-parameter-count']} for model parameter counts. Nondeterministic stack attention (Tf+Nd) outperforms standard attention (Tf) on $w w^R$, $w a^p w^R$, and Hardest CFL; and it achieves the best in-distribution performance on Hardest CFL despite having the fewest parameters.
  • Figure 3: Visualization of superposition stack attention on a string in $w \texttt{\#} w^R$. As expected, the model learns to push all symbols before #, do nothing when reading #, and pop all symbols after #.
  • Figure 4: Visualization of superposition stack attention on a string in the Dyck language. As expected, the model learns to push opening brackets and pop when reading closing brackets.

Theorems & Definitions (1)

  • Definition 1