Table of Contents
Fetching ...

The geometry of BERT

Matteo Bonino, Giorgia Ghione, Giansalvo Cirrincione

TL;DR

The paper addresses interpretability of BERT by linking how local attention patterns and the global information stream drive decisions. It proposes two novel concepts, directionality of subspace selection and a cone index $I_{cone}([l]) = \| \sum_{x \in [l]} x \|$, and develops a taxonomy of self-attention patterns (gates) to explain how tokens are searched and integrated. It uses a SARS-CoV-2 RNA variant classification case study to ground the analysis in a real task and shows how per-head patterns, concatenation operations, and the MLP influence the recall process. The results point to architectural implications for Transformer models and offer a framework for future training-time analysis.

Abstract

Transformer neural networks, particularly Bidirectional Encoder Representations from Transformers (BERT), have shown remarkable performance across various tasks such as classification, text summarization, and question answering. However, their internal mechanisms remain mathematically obscure, highlighting the need for greater explainability and interpretability. In this direction, this paper investigates the internal mechanisms of BERT proposing a novel perspective on the attention mechanism of BERT from a theoretical perspective. The analysis encompasses both local and global network behavior. At the local level, the concept of directionality of subspace selection as well as a comprehensive study of the patterns emerging from the self-attention matrix are presented. Additionally, this work explores the semantic content of the information stream through data distribution analysis and global statistical measures including the novel concept of cone index. A case study on the classification of SARS-CoV-2 variants using RNA which resulted in a very high accuracy has been selected in order to observe these concepts in an application. The insights gained from this analysis contribute to a deeper understanding of BERT's classification process, offering potential avenues for future architectural improvements in Transformer models and further analysis in the training process.

The geometry of BERT

TL;DR

The paper addresses interpretability of BERT by linking how local attention patterns and the global information stream drive decisions. It proposes two novel concepts, directionality of subspace selection and a cone index , and develops a taxonomy of self-attention patterns (gates) to explain how tokens are searched and integrated. It uses a SARS-CoV-2 RNA variant classification case study to ground the analysis in a real task and shows how per-head patterns, concatenation operations, and the MLP influence the recall process. The results point to architectural implications for Transformer models and offer a framework for future training-time analysis.

Abstract

Transformer neural networks, particularly Bidirectional Encoder Representations from Transformers (BERT), have shown remarkable performance across various tasks such as classification, text summarization, and question answering. However, their internal mechanisms remain mathematically obscure, highlighting the need for greater explainability and interpretability. In this direction, this paper investigates the internal mechanisms of BERT proposing a novel perspective on the attention mechanism of BERT from a theoretical perspective. The analysis encompasses both local and global network behavior. At the local level, the concept of directionality of subspace selection as well as a comprehensive study of the patterns emerging from the self-attention matrix are presented. Additionally, this work explores the semantic content of the information stream through data distribution analysis and global statistical measures including the novel concept of cone index. A case study on the classification of SARS-CoV-2 variants using RNA which resulted in a very high accuracy has been selected in order to observe these concepts in an application. The insights gained from this analysis contribute to a deeper understanding of BERT's classification process, offering potential avenues for future architectural improvements in Transformer models and further analysis in the training process.

Paper Structure

This paper contains 41 sections, 4 theorems, 48 equations, 13 figures, 1 table.

Key Result

Theorem 4.2.1

bernstein2018 Let $\mathscr{C}_1$ and $\mathscr{C}_2$ be the column spaces of two matrices $A$ and $B$ and $\mathscr{R}_1$ and $\mathscr{R}_2$ the corresponding row spaces. Then, if $A,B \in \mathbb{R}^{m \times n}$ (or $\mathbb{C}^{m \times n})$ the rank satisfies the subadditivity property where the equality holds true if and only if

Figures (13)

  • Figure 1: BERT architecture introduced in DevlinBERT.
  • Figure 2: Histograms of the norms of the row-vectors $Q_i, K_i$ and $V_i$ of the matrices $Q,K$ and $V$, respectively, of a correctly classified sequence in head 4, layer 1.
  • Figure 3: Attention map of head 9, layer 4 of a correctly classified sequence, showing a directional gate.
  • Figure 4: Attention map of head 4, layer 2 of a correctly classified sequence, showing an open gate.
  • Figure 5: Attention map of head 10, layer 7 of a correctly classified sequence, showing forward contextual gates and backward contextual gates.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Theorem 4.2.1
  • Proposition 4.2.2
  • Definition 6.3.1
  • Remark 6.3.2
  • Proposition 6.3.3
  • Definition 6.3.4
  • Remark 6.3.5
  • Proposition 6.3.6