Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

Evgeniy Shin; Heinrich Matzinger

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

Evgeniy Shin, Heinrich Matzinger

TL;DR

This work tackles whether logical reasoning in transformers occurs inside self-attention or in FC layers by constructing and analyzing one-level transformers tasked with predicting the category-pair function $Y_i = q^{true}(X_{i-1}, X_i)$ from token-category inputs. It presents three hand-programmed architectures that either accumulate adjacent tokens via attention or extract category-pair functionals via bilinear attention, and it proves a formal equivalence between two of these pathways. The paper also investigates gradient-zero phenomena and shows that Softmax or rescaling mitigates such zeros, enabling learning to realize the intended logic. Empirical results on synthetic data reveal that multiple pathways are viable, with the second and third designs often converging to effective solutions, highlighting the transformer’s architectural flexibility in performing logical analysis.

Abstract

Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

TL;DR

from token-category inputs. It presents three hand-programmed architectures that either accumulate adjacent tokens via attention or extract category-pair functionals via bilinear attention, and it proves a formal equivalence between two of these pathways. The paper also investigates gradient-zero phenomena and shows that Softmax or rescaling mitigates such zeros, enabling learning to realize the intended logic. Empirical results on synthetic data reveal that multiple pathways are viable, with the second and third designs often converging to effective solutions, highlighting the transformer’s architectural flexibility in performing logical analysis.

Abstract

Paper Structure (15 sections, 215 equations, 3 figures)

This paper contains 15 sections, 215 equations, 3 figures.

Introduction
Three different hand programmed solutions
First approach: bringing the information together with self-attention
Second approach: using self-attention for identifying category pairs
Third approach
Equivalence between third and second solution
Zeros for gradient descent
The case where self-attention acts only on the positional encoding
Solving the problem with Softmax, case when self-attention acts only on positional encoding
Gradient zero for Self-attention not having access to positional encoding
Proving that $\sum_{j\neq i}v_{ij}=0$ is invalid
Learning q,k,v,C,B
Experiments
Results
Conclusion

Figures (3)

Figure 1: Compare $q^{\tt true}$ with learned attention weights for the second solution
Figure 2: Log-MSE for the combined model in four flavors
Figure 3: Four flavours of the combined model

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

TL;DR

Abstract

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)