Attention as a Hypernetwork
Simon Schug, Seijin Kobayashi, Yassir Akram, João Sacramento, Razvan Pascanu
TL;DR
This work reframes multi-head attention as a hypernetwork that uses a low-dimensional latent code, derived from head-wise attention, to configure a per-key-query value network. It shows that scaling model size and data facilitates compositional generalization on abstract reasoning tasks and yields a structured latent space predictive of the network's function. By introducing HYLA, which adds a nonlinear value network and head-wise normalization, the authors demonstrate improved compositional generalization on challenging tasks, including a symbolic Raven-like benchmark (sraven). The results suggest the hypernetwork mechanism in attention underpins substantial aspects of in-context learning and compositionality, with practical implications for understanding and improving large-scale transformer models.
Abstract
Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven's Progressive Matrices human intelligence test, which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.
