Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush
Abstract
Attention networks have proven to be an effective approach for embedding
categorical inference within a deep neural network. However, for many tasks we
may want to model richer structural dependencies without abandoning end-to-end
training. In this work, we experiment with incorporating richer structural
distributions, encoded using graphical models, within deep networks. We show
that these structured attention networks are simple extensions of the basic
attention procedure, and that they allow for extending attention beyond the
standard soft-selection approach, such as attending to partial segmentations or
to subtrees. We experiment with two different classes of structured attention
networks: a linear-chain conditional random field and a graph-based parsing
model, and describe how these models can be practically implemented as neural
network layers. Experiments show that this approach is effective for
incorporating structural biases, and structured attention networks outperform
baseline attention models on a variety of synthetic and real tasks: tree
transduction, neural machine translation, question answering, and natural
language inference. We further find that models trained in this way learn
interesting unsupervised hidden representations that generalize simple
attention.