Table of Contents
Fetching ...

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

Daichi Hayakawa, Issei Sato

TL;DR

It is demonstrated that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures, and it is demonstrated that Transformers without positional encoding can generate hierarchical languages.

Abstract

In this study, we provide constructive proof that Transformers can recognize and generate hierarchical language efficiently with respect to model size, even without the need for a specific positional encoding. Specifically, we show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages. Furthermore, we suggest that explicit positional encoding might have a detrimental effect on generalization with respect to sequence length.

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

TL;DR

It is demonstrated that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures, and it is demonstrated that Transformers without positional encoding can generate hierarchical languages.

Abstract

In this study, we provide constructive proof that Transformers can recognize and generate hierarchical language efficiently with respect to model size, even without the need for a specific positional encoding. Specifically, we show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages. Furthermore, we suggest that explicit positional encoding might have a detrimental effect on generalization with respect to sequence length.

Paper Structure

This paper contains 68 sections, 32 theorems, 238 equations, 9 figures, 5 tables.

Key Result

Proposition 1

For any language $\mathcal{L} \subset \Sigma^*$ over a finite alphabet $\Sigma$ and any probability distribution $p$ over $\mathcal{L}$, there exists a language generation process that produces the given probability distribution $p$. In other words, there exists a language generation process $p_\mat where

Figures (9)

  • Figure 1: (Left) Test accuracy of generating the correct closed brackets on $\texttt{Dyck}_8$. (Right) Test accuracy of generating the correct closed bracket on $\texttt{Shuffle-Dyck}_8$. The solid lines represent the results for in-distribution data ($n \leq 700$), while the dashed lines represent the results for out-of-distribution data ($700 < n \leq 840$). In both experiments, results are averaged over $5$ runs with different random seeds.
  • Figure 2: An example of string that belongs to $\texttt{Shuffle-Dyck}_3$ not to $\texttt{Dyck}_3$. Each substring of type $t \in \{1, 2, 3\}$ is properly balanced.
  • Figure 3: Illustration of the process where the query $"\rangle_3"$ in the input string $"\texttt{<bos>} \langle_2 \langle_1 \rangle_1 \rangle_2 \langle_3 \rangle_3"$ fetches the nearest depth-matched open bracket $"\langle_3"$. At first, using $\operatorname{T}^{\mathrm{depth}}$, only the depth-matched open brackets and $\texttt{<bos>}$ are extracted, and then, using $\operatorname{T}^{\mathrm{pos}}$, the nearest one among them is extracted.
  • Figure 4: Illustration of the recovering function.
  • Figure 5: Test accuracy over $5$ runs of generating the correct closed brackets on $\texttt{Dyck}_k (k \in \{1 ,2 ,4 ,8, 16\})$. The solid lines represent the results for in-distribution data ($n \leq 700$), while the dashed lines represent the results for out-of-distribution data ($700 < n \leq 840$).
  • ...and 4 more figures

Theorems & Definitions (75)

  • Definition 1: $\texttt{Dyck}_k$ language for language models
  • Definition 2: $\texttt{Shuffle-Dyck}_k$ language for language models (informal)
  • Definition 3: Prefix for language
  • Definition 4: Depth of string
  • Definition 5: Language recognition by Transformers
  • Definition 6: Language generation process
  • Proposition 1
  • proof
  • Definition 7: Realization of language generation process by Transformers
  • Definition 8: $\texttt{Dyck}_k$ language generation process
  • ...and 65 more