Simulating Weighted Automata over Sequences and Trees with Transformers

Michael Rizvi; Maude Lizaire; Clara Lacroce; Guillaume Rabusseau

Simulating Weighted Automata over Sequences and Trees with Transformers

Michael Rizvi, Maude Lizaire, Clara Lacroce, Guillaume Rabusseau

TL;DR

The paper theoretically demonstrates that transformers can exactly simulate weighted finite automata over sequences and approximately simulate weighted tree automata over trees, achieving logarithmic depth in the input length for WFAs via hard attention and bilinear layers, and depth linear in tree depth for WTAs (logarithmic depth achievable on balanced trees). It provides precise architectural parameters and proves that standard transformer components (soft attention and MLPs) can approximate WFAs with depth $\mathcal{O}(\log T)$ and width $\mathcal{O}(n^4)$, independent of the desired precision $\epsilon$. Empirically, gradient-based training on synthetic data shows transformers can learn these compact solutions, with layer count and embedding size scaling broadly in line with theory though some hyperparameter tuning is needed. The work extends prior DFA-focused results to the broader classes of WFAs and WTAs, and implies that transformers can serve as efficient computational shortcuts for complex sequential and hierarchical reasoning tasks, with implications for understanding the computational limits of transformer-based models.

Abstract

Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.

Simulating Weighted Automata over Sequences and Trees with Transformers

TL;DR

and width

, independent of the desired precision

. Empirically, gradient-based training on synthetic data shows transformers can learn these compact solutions, with layer count and embedding size scaling broadly in line with theory though some hyperparameter tuning is needed. The work extends prior DFA-focused results to the broader classes of WFAs and WTAs, and implies that transformers can serve as efficient computational shortcuts for complex sequential and hierarchical reasoning tasks, with implications for understanding the computational limits of transformer-based models.

Abstract

Paper Structure (52 sections, 6 theorems, 36 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 52 sections, 6 theorems, 36 equations, 7 figures, 4 tables, 1 algorithm.

INTRODUCTION
PRELIMINARIES
Notation
Weighted Finite Automata
Weighted Tree Automata
Transformers
Bilinear Layers
SIMULATING WEIGHTED AUTOMATA OVER SEQUENCES
Simulation Definition
Main theorems
SIMULATING WEIGHTED TREE AUTOMATA
Simulation definition
Results
EXPERIMENTS
Can logarithmic solutions be found?
...and 37 more sections

Key Result

Theorem 1

Transformers using bilinear layers in place of an MLP and hard attention can exactly simulate all WFAs with $n$ states at length $T$, with depth $\mathcal{O}(\log T)$, embedding dimension $\mathcal{O}(n^2)$, attention width $\mathcal{O}(n^2)$, MLP width $\mathcal{O}(n^2)$ and $\mathcal{O}(1)$ attent

Figures (7)

Figure 1: Simulation of the WFA computation over the input $w=abba$ with a transformer.
Figure 2: Computation of a WTA on the input tree $t=(a,((b,b),b))$ (left) and simulation of the WTA computation over $t$ with a transformer (right).
Figure 3: Average MSE vs. number of layers: For all considered sequence lengths, adding layers has an notable effect on the MSE at first, however past a certain point, the improvement is negligible. This stabilization is consistent with our theoretical results (shown as dotted lines).
Figure 4: Average MSE (log scale) vs. embedding size: Increasing the embedding size also has a notable effect on the MSE. However the stabilization of the curves does not agree with as closely with our theoretical results (shown as dotted lines).
Figure 5: Illustration of the prefix sum algorithm
...and 2 more figures

Theorems & Definitions (22)

Definition 2.1
Definition 2.2
Definition 2.3
Definition 2.4
Definition 3.1
Definition 3.2
Definition 3.3
Theorem 1
Theorem 2
Definition 4.1
...and 12 more

Simulating Weighted Automata over Sequences and Trees with Transformers

TL;DR

Abstract

Simulating Weighted Automata over Sequences and Trees with Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (22)