A Transformer with Stack Attention
Jiaoda Li, Jennifer C. White, Mrinmaya Sachan, Ryan Cotterell
TL;DR
This work proposes a differentiable stack-based attention that can be plugged into transformers to enhance their capacity to learn deterministic context-free languages. By emulating a stack over the input index set and inserting a stack-attention sub-layer within each transformer layer, the approach yields improved performance on several CF tasks (notably reverse strings and stack manipulation) while maintaining interpretability through stack-top attention maps. The results also reveal limitations, showing that modular arithmetic tasks remain challenging and that the stack augmentation provides diminishing benefits as training data grows, suggesting the method offers an inductive bias useful in data-scarce settings but not a universal CF-language solution. Overall, the paper presents a modular, interpretable augmentation that extends transformer expressivity for hierarchical structure while outlining practical and theoretical limitations and directions for future work.
Abstract
Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in the modeling power of transformer-based language models, we propose augmenting them with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
