Table of Contents
Fetching ...

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

Evgeniy Shin, Heinrich Matzinger

TL;DR

This work tackles whether logical reasoning in transformers occurs inside self-attention or in FC layers by constructing and analyzing one-level transformers tasked with predicting the category-pair function $Y_i = q^{true}(X_{i-1}, X_i)$ from token-category inputs. It presents three hand-programmed architectures that either accumulate adjacent tokens via attention or extract category-pair functionals via bilinear attention, and it proves a formal equivalence between two of these pathways. The paper also investigates gradient-zero phenomena and shows that Softmax or rescaling mitigates such zeros, enabling learning to realize the intended logic. Empirical results on synthetic data reveal that multiple pathways are viable, with the second and third designs often converging to effective solutions, highlighting the transformer’s architectural flexibility in performing logical analysis.

Abstract

Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.

Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

TL;DR

This work tackles whether logical reasoning in transformers occurs inside self-attention or in FC layers by constructing and analyzing one-level transformers tasked with predicting the category-pair function from token-category inputs. It presents three hand-programmed architectures that either accumulate adjacent tokens via attention or extract category-pair functionals via bilinear attention, and it proves a formal equivalence between two of these pathways. The paper also investigates gradient-zero phenomena and shows that Softmax or rescaling mitigates such zeros, enabling learning to realize the intended logic. Empirical results on synthetic data reveal that multiple pathways are viable, with the second and third designs often converging to effective solutions, highlighting the transformer’s architectural flexibility in performing logical analysis.

Abstract

Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.
Paper Structure (15 sections, 215 equations, 3 figures)