Automata Extraction from Transformers

Yihao Zhang; Zeming Wei; Meng Sun

Automata Extraction from Transformers

Yihao Zhang, Zeming Wei, Meng Sun

TL;DR

This work tackles the interpretability of Transformer-based encoders by proposing a fully automated pipeline that extracts deterministic finite automata (DFA) from encoder-only Transformers. It introduces a deterministic continuous-state automaton (DCSA) as an intermediate abstraction, built on the representation space of the Transformer, and then applies the L$^*$ algorithm to obtain a DFA. A representation-alignment objective ties Transformer representations to the DCSA states, enabling faithful DFA extraction across regular languages, with evaluation on Tomita benchmarks and related formal-language datasets. Results show near-perfect DFA consistency for learnable languages, stable generalization across architectures, and valuable insights into Transformer behavior on formal languages, including when memorization dominates generalization. This approach provides a concrete, interpretable view into how Transformer models process formal languages and lays groundwork for deeper mechanistic analyses of stateful behavior in such architectures.

Abstract

In modern machine (ML) learning systems, Transformer-based architectures have achieved milestone success across a broad spectrum of tasks, yet understanding their operational mechanisms remains an open problem. To improve the transparency of ML systems, automata extraction methods, which interpret stateful ML models as automata typically through formal languages, have proven effective for explaining the mechanism of recurrent neural networks (RNNs). However, few works have been applied to this paradigm to Transformer models. In particular, understanding their processing of formal languages and identifying their limitations in this area remains unexplored. In this paper, we propose an automata extraction algorithm specifically designed for Transformer models. Treating the Transformer model as a black-box system, we track the model through the transformation process of their internal latent representations during their operations, and then use classical pedagogical approaches like L* algorithm to interpret them as deterministic finite-state automata (DFA). Overall, our study reveals how the Transformer model comprehends the structure of formal languages, which not only enhances the interpretability of the Transformer-based ML systems but also marks a crucial step toward a deeper understanding of how ML systems process formal languages. Code and data are available at https://github.com/Zhang-Yihao/Transfomer2DFA.

Automata Extraction from Transformers

TL;DR

algorithm to obtain a DFA. A representation-alignment objective ties Transformer representations to the DCSA states, enabling faithful DFA extraction across regular languages, with evaluation on Tomita benchmarks and related formal-language datasets. Results show near-perfect DFA consistency for learnable languages, stable generalization across architectures, and valuable insights into Transformer behavior on formal languages, including when memorization dominates generalization. This approach provides a concrete, interpretable view into how Transformer models process formal languages and lays groundwork for deeper mechanistic analyses of stateful behavior in such architectures.

Abstract

Paper Structure (13 sections, 7 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 7 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Methods
Motivations
Algorithm Design
Experimental Settings
Exprerimental Results and Analysis
Effectiveness (RQ1)
Generalization (RQ2)
Explainability (RQ3)
Ablation (RQ4)
Related Work
Conclusion

Figures (5)

Figure 1: An illustration for overall extraction pipeline.
Figure 2: Extracted automaton from transformer trained on Tomita 3 language, which is exactly the complement of $((0|1)^{*}0)^{*}1(11)^{*}(0(0|1)^{*}1)^{*}0(00)^{*}(1(0|1)^{*})^{*}$ over the alphabet $\{0,1\}$. The extraction result is basically correct, with a consistent rate over 97%.
Figure 3: Extracted automaton from transformer trained on Tomita 4 language, which contains all words in $\{0,1\}^*$ containing no "000". The extraction result is completely correct.
Figure 4: Extracted automaton from transformer trained on $\mathcal{D}_4$ language. $\mathcal{D}_n$ is defined recursively as a regular language, where $\mathcal{D}_0 = \varepsilon$ and $\mathcal{D}_n = (0 \mathcal{D}_{n-1} 1)^*$. The extraction result is completely correct.
Figure 5: Extracted automaton from transformer trained on Mod 3 language, which contains all binary numbers that can be divided by 3.

Automata Extraction from Transformers

TL;DR

Abstract

Automata Extraction from Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)