Are queries and keys always relevant? A case study on Transformer wave functions
Riccardo Rende, Luciano Loris Viteritti
TL;DR
This work investigates whether the canonical queries-keys mechanism of Transformer attention is essential when parametrizing quantum many-body ground states with a Vision Transformer–based neural network state. Through variational Monte Carlo with Stochastic Reconfiguration, the authors compare standard attention (T5/Decoupled) to a Factored, input-independent attention in the $2$D $J_1$-$J_2$ Heisenberg model, finding essentially identical accuracy but reduced cost for the latter. They show that attention weights become input-independent at convergence and provide analytical arguments, including an exact mapping for the Shastry-Sutherland ground state, explaining why large systems favor positional-only connections. The results suggest that, in systems with decaying correlations, queries and keys may be unnecessary, which has practical implications for scaling Transformer-based quantum states and potentially informs attention design in NLP and vision tasks with long sequences.
Abstract
The dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity overlap between queries and keys. In this work, we explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions to approximate ground states of quantum many-body spin Hamiltonians. Specifically, we perform numerical simulations on the two-dimensional $J_1$-$J_2$ Heisenberg model, a common benchmark in the field of quantum many-body systems on lattice. By comparing the performance of standard attention mechanisms with a simplified version that excludes queries and keys, relying solely on positions, we achieve competitive results while reducing computational cost and parameter usage. Furthermore, through the analysis of the attention maps generated by standard attention mechanisms, we show that the attention weights become effectively input-independent at the end of the optimization. We support the numerical results with analytical calculations, providing physical insights of why queries and keys should be, in principle, omitted from the attention mechanism when studying large systems.
