QAMA: Scalable Quantum Annealing Multi-Head Attention Operator for Deep Learning
Peng Du, Jinjing Shi, Wenxuan Wang, Yin Ma, Kai Wen, Xuelong Li
TL;DR
QAMA reformulates multi-head attention as a Hamiltonian energy minimization problem solved by quantum annealing, enabling sparse, scalable attention with hardware compatibility on coherent Ising machines. It introduces a three-term QUBO/Ising representation that encodes query-key interactions, value importance, and head sparsity, along with a gradient-propagation approach for non-differentiable layers. Empirically, it achieves accuracy within $2.7$ points of standard attention on NLP and CV benchmarks while requiring only $O(H\,N)$ qubits, and demonstrates dramatic CIM-based speedups with minimal loss in performance. Hardware experiments on CIM confirm microsecond-level inference and high fidelity relative to simulations, validating the practical viability of quantum-augmented attention. Overall, QAMA offers a principled, hardware-friendly pathway to scalable attention in deep learning.
Abstract
Attention mechanisms underpin modern deep learning, while the quadratic time and space complexity limit scalability for long sequences. To address this, Quantum Annealing Multi-Head Attention (QAMA) is proposed, a novel drop-in operator that reformulates attention as an energy-based Hamiltonian optimization problem. In this framework, token interactions are encoded into binary quadratic terms, and quantum annealing is employed to search for low-energy configurations that correspond to effective attention patterns. Unlike classical sparse or approximate attention methods that rely on hand-crafted heuristics, QAMA allows sparsity structures to emerge naturally from the optimization process. Theoretically, computational complexity is analysed through single-spin flip dynamics, providing time to solution runtime bounds that depend on the spectral properties of the annealing Hamiltonian. Empirically, evaluation on both natural language and vision benchmarks shows that, across tasks, accuracy deviates by at most 2.7 points from standard multi-head attention, while requiring only linear qubits in sequence length. Visualizations further reveal that the Hamiltonian penalty terms induce meaningful and interpretable sparsity across heads. Finally, deployment on a coherent Ising machine validates the feasibility of running QAMA on real quantum hardware, showing tangible inference-time reductions compared with classical implementations. These results highlight QAMA as a pioneering and scalable step toward integrating quantum optimization devices into deep neural architectures, providing a seamlessly integrable and hardware-compatible alternative to conventional attention mechanisms. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
