Table of Contents
Fetching ...

QV May Be Enough: Toward the Essence of Attention in LLMs

Zhang Edward

Abstract

Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer architecture. Based on this theoretical foundation, we provide a unified explanatory framework for the efficacy of contemporary architectures, including MQA, GQA, and MLA, while identifying their inherent trade-offs and potential optimization trajectories. We introduce the QV paradigm and provide empirical evidence for its validity. Building upon this, we propose the QV-Ka optimization scheme, which is further substantiated through experimental validation. The interpretable theoretical analysis of the QKV mechanism presented in this work establishes a robust foundation for the future evolution of large language model architectures.

QV May Be Enough: Toward the Essence of Attention in LLMs

Abstract

Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer architecture. Based on this theoretical foundation, we provide a unified explanatory framework for the efficacy of contemporary architectures, including MQA, GQA, and MLA, while identifying their inherent trade-offs and potential optimization trajectories. We introduce the QV paradigm and provide empirical evidence for its validity. Building upon this, we propose the QV-Ka optimization scheme, which is further substantiated through experimental validation. The interpretable theoretical analysis of the QKV mechanism presented in this work establishes a robust foundation for the future evolution of large language model architectures.
Paper Structure (12 sections, 8 equations, 7 figures, 4 tables)

This paper contains 12 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Matching of Different Tokens
  • Figure 2: QKV Paradigm
  • Figure 3: QV Paradigm
  • Figure 4: Diffusion of Deep-Matching
  • Figure 5: QV Mode vs QKV Mode
  • ...and 2 more figures